|
Do you find the Support WebCast transcripts helpful? Let us know!
Microsoft Support WebCast
Troubleshooting SMS 2.0 site server Performance Problems
August 28, 2001
Note This document is based on the original spoken Support WebCast transcript. It has been edited for clarity.
Bruce Jones: The presentation today is not aimed at trying to help you size your hardware for your site servers. It's not aimed at helping you configure your hardware correctly. The assumption is that you've followed all the recommendations in the SMS Administrator's Guide, the SMS Resource Guide, and the SMS Sizing white paper, and that you've purchased and configured your site servers correctly.
In fact, this presentation is assuming that you're a fairly advanced user and you know how to configure hardware, you know how to use Performance Monitor, you know how to run Transact-SQL queries and so forth. What this presentation is specifically aimed at, is trying to help you figure out why a site server that was running efficiently at one point is now having some performance problems.
Now, Alliance Support Professionals work with one or perhaps a few customers only. As a result, we often will work with the same customer on the same site server over the course of a few months or perhaps even years. I believe we have gathered a perspective and a set of techniques for troubleshooting site server performance problems that will be helpful to you. This presentation is about sharing with you that perspective, and those tricks and techniques that we use with our customers. I hope you find it very helpful.
Let's move on to our next presentation outline slides (2 and 3). We're going to cover 10 crucial aspects of site server performance troubleshooting today. At the end, what we're going to do is walk through an actual case study that will tie them all together in a practical way for you. We're going to talk about measuring performance in meaningful ways, maintenance tasks that are vital to maintaining good performance, how to reduce the amount of data that is collected and stored, how to ensure database integrity to have a fast, efficient database, and some crucial Microsoft® SQL Server™ tuning tips that can have a huge impact. We're also going to talk about the sometimes-mysterious topic of query tuning. We're going to look at how to reduce the impact of reporting on your site server. We're going to look at moving services off the site server, and we're going to spend just a little bit of time, not much, on hardware sizing and configuration.
Then of course, we're going to move into the case study. Now, the case study, what I want to do there is have you look over my shoulder as a customer and I work together on some performance problems that we were having about a year ago. We worked on a central site server and some regional servers, and hopefully this will show you that it's almost never one thing that impacts a site server. It's usually a combination of factors.
Now, let's move on to the first topic that we're going to cover today, measuring performance of the SMS Executive (slide 4). When I run into a server that's experiencing performance problems, the symptoms reported are usually one of the following: Either the SMS Executive is hung, or it's using excessive resources, or the Inboxes are badly backlogged.
Let's look at how to measure the performance of the SMS Executive in a meaningful way. If you were sitting in front of your site server, you might hit CTRL+ALT+DELETE and select Task Manager. You'd then look at the SMS Executive CPU utilization and look at the Memory utilization. You'd ask yourself, is either unusually high or unusually low? Now sometimes you can be fooled by this because high CPU utilization, around 70%, 80% or more, can actually be a good thing. It means that SMS is doing work.
So what we need to do is determine whether or not the level of work that you see is appropriate. The best way to do this is to compare it to a baseline. If you've been monitoring your site server over time, you'll have a pretty good feel that your Executive should be running at perhaps using 20 megabytes (MB) of RAM, something like that, and the CPU might be bouncing somewhere between 15% and 45%. It varies site to site so, you can't take my baseline numbers. You'll have to establish your own.
The other way you can establish whether the activity is appropriate is you can stop and start the SMS Executive and see if memory utilization goes down drastically. Memory utilization going down drastically can be a sign that you might have a memory leak, and in the case of duplicate GUIDs, Dataloader may be leaking memory.
So we've determined using Task Manager whether or not the SMS Executive might be having problems by looking at the CPU utilization and the Memory usage. Of course, the SMS Executive is a parent thread to a bunch of child threads, which correspond to the child services that SMS has.
So moving on to the next slide (5), what we need to do is find out, if there is an SMS Executive problem, which of the child threads are guilty of creating the problem. To measure performance of a particular, specific child thread, you must use Performance Monitor. Now, I've given you all the information you need, on the slide, to use Performance Monitor to identify and measure the performance of each child thread. Again, the level of this presentation assumes that you already have Performance Monitor skills.
Now you could get really bogged down in PerfMon. You could start creating Excel spreadsheets and PivotTables®, and you could create WMI scripts that do all kinds of fancy things when Dataloader exceeds X percent or Y amount of time and so on. While you're doing that though, I'm going to do something else. I'm going to do a trick that I use in trying to determine which thread is guilty very quickly.
What I do is I load the SMS Performance Monitor counters, and I stop each child service in the SMS Service Manager one at a time. I'm sure you're familiar that the SMS Service Manager is available in the Tools section of the SMS Administrative Console. What I do is I watch the activity in SMS Performance Monitor, and I watch the lines or those threads that are pegged to 100%. I stop each child service in the SMS Service Manager one at a time, and I watch for that line that's pegged at 100% to drop down to 0. When a line drops down to 0, I've found the guilty thread, and perhaps if you're busy trying to do the Excel thing, you're still writing PivotTables and all that kind of thing, I've already found the problem. It's a real quick trick using Performance Monitor and the SMS Service Manager while watching that activity and waiting for it to drop to 0.
Now, I've given you the Knowledge Base article, Q234508, which shows you how to do this. So you don't have to remember it from the presentation, look up that Knowledge Base article.
Now, there are rare instances when you cannot connect to a server. The performance is so impacted by extreme memory utilization or extreme CPU utilization that Performance Monitor will not connect in a timely manner, or the SMS Service Manager will not connect in a timely manner. On your slides, it says to set each thread's startup type to Manual in the registry, start the Executive, and start each thread. I thought about that, and I made an adjustment to the slides. The last bullet should say, "Rename each child DLL as identified in the SMS Executive log, and start the SMS Executive." {Editor's note: The PowerPoint slides have been corrected.}
In the next batch of slides, you'll see there's another tactic there that often our Critical Problem Resolution group uses, and that is to go into the SMS Executive log, identify each of the DLLs associated with those child threads, to rename them after stopping the Executive, and then, name them back one at a time starting the Executive to find which child service is badly impacting the box. Again, I would not recommend that you do that for a first step; that you only do that if the box is impacted so quickly that you cannot use any other method to determine which child thread is responsible.
Once you've determined which thread is responsible for a performance problem, whether it's Despooler or Dataloader or something like, you need to determine whether this is in fact a problem or whether this is activity that's appropriate. On the next slide (6), we will determine whether the activity is appropriate using a couple of methods.
The first thing that we would do is we would review the Status system for the component. Let's say we've determined, by stopping each of the threads, that Despooler is using a high amount of the CPU utilization time. What we would do first is we would go to the Status system and look at the Despooler component to see if Despooler is just working away on jobs as it should, or if it's having some sort of problem.
Now, if it's a child service that accesses the database, like Dataloader or Data Discovery Manager, it might be helpful to see which queries are running, if any of those queries are taking a long time or running up the processor and perhaps causing a memory leak. So you might turn on SQL Tracing, per the Knowledge Base article, Q241001, and then monitor the associated log. Now, you need to be very careful not to leave it turned on, because you may get up to a 15% performance hit just from the SQL Tracing activity. So we would check the Status system of that component. We would review the associated log of that component, and then we might check the associated Inbox of that component. If Dataloader is the problem, how many MIFs are backed up in the Inbox? Does it always spike the processor on a particular machine's MIF? Is there a corrupted MIF, a 0-byte MIF? Check the associated Inbox and look for things that are anomalous, that are not normal, in order to determine what is going on.
Now, we've measured the performance of the SMS Executive in a meaningful way, we've taken a look at the SMS Executive child threads and seen how to determine whether activity is appropriate or not. Let's move on from the Executive to the database, and see a meaningful way to measure database-processing efficiency on the next slide (7).
We don't have to get deep into SQL to find some meaningful things to measure. It might be tempting to use the SMS Performance Monitor MIF counters or the DDR counters to try to attempt to measure how efficient your database's processing efficiency is, but I want to convince you here to rely first on two SQL counters instead. The first SQL counter that you should look at is the cache hit ratio. The second one is transactions per second.
These two counters will tell you some crucial things. First, do you know if your database is cached in RAM? Is SQL processing all those MIFs in RAM, where it's good and fast, or is SQL processing all those MIFs by hitting the hard drive every time? Well, here's how you find out.
If you go to SQL Query Analyzer, which is in your SQL Program Group, you can put the focus of SQL Query Analyzer onto the SMS database on your site server and run the command DBCC SQLPERF. You'll get a result, and if your cache hit ratio is less than 99.7%, you may need to consider adding RAM. Now, be careful. If your site is very, very inactive, it's a small site and so on, it may be that your database just hasn't had a chance to fully load into RAM yet, but to observe this on a busy site, you can stop and start SQL Server and then the Executive, while monitoring the cache hit ratio. You'll see it rise to 99% or greater, as the database loads into RAM.
The next Counter that you might want to look at is the Transactions per Second counter. Now, the reason the Transactions per Second counter is an effective way of measuring database processing efficiency, is because a transaction per second is the only consistent measure that you have.
Let's say instead, you were trying to use the MIF counters. So how many MIFs per second isn't very meaningful over a short period of time, because some MIFs are very small and have maybe 10 transactions in them, and some MIFs, like full-inventory MIFs or resyncs, are maybe very large and have hundreds of transactions in them. So how many of those MIFs that get processed per second won't give you the most efficient measurement of how efficient your database is processing.
Instead, we should look at how many transactions per second are running, and higher is better. Now, if you have run this counter, you will now need to interpret the results. If you have a low number of transactions per second, you will have to look in and see what the underlying problem is and fix it. If you have a high number of transactions per second, but you're still experiencing backlogs, you'll need to reduce the amount of data generated or consider adding hardware. I'll give you some numbers for what might be appropriate, but every site is different depending on the hardware, the load, and the configuration.
Moving on to the next slide (8), we're going to look at Inbox processing efficiency. We've looked at measuring the Executive. We've looked at measuring how efficient the database is at processing. Now, we're going to look at the Inboxes themselves and talk about how to use them to measure performance.
While accurate enough for a general sense, backlogs are a good way to make sure that you don't have a performance problem. If you're seeing backlogs, you may have a performance problem that you need to address. Watch for a sudden occurrence of backlogs or a sudden change in trend in the MIF or DDR Performance Monitor counters. You can also watch for how old the oldest files are. So you could say something like, I have a backlog of files three days old in this particular Inbox. So watch for those backlogs, watch for slow processing, and while you're in there, look for stale files, MIFs that are six months old, for example, and 0-byte files. We'll talk more about 0-byte files in a later slide. You also might check and see whether your virus scanners are scanning that Inbox.
In addition to Performance Monitor and of course Explorer, there are some third-party tools that can help you monitor the Inboxes. I'm sure if you go up to some of the Web sites that you folks frequent as SMS customers, you will find reference to them there.
Once you have found a backlog in a particular Inbox, you need to use that measurement to figure out what the underlying cause behind the problem is. Again, we can simply look at the associated component's Status System messages, see whether or not it's processing properly, the things are moving as they should. For example, you might look in the Dataloader component Status System and find that a lot of MIFs are being processed from a site attachment that happened the day before, or you might find that you have a lot of resyncs, and that might be something that you need to address.
If the Status System doesn't give you enough detail, you can then move to the associated log. You can look at the date/time stamp of a particular status message that is of interest to you, and then you can open the associated log, look for that date/time stamp and find more activity surrounding that particular problem.
One of my favorite tools for troubleshooting backlogs is Notepad. I remember a particular case a while back, where I was looking at a backlog in the Despooler\Receive directory, and I didn't happen to have the SMS Administrator console available to me. I had simply had the Despooler\Receive directory zipped and sent to me. So I opened the files up in Notepad, and although most of the file is in binary, enough of it is in plain English that you can read, and I saw that these were all resyncs. So we began troubleshooting from the perspective of looking for what was generating the resyncs. In that particular case, it had to do with a service pack installation causing resyncs on those clients, and we simply allowed time for those resyncs to process.
We've looked at measuring performance in a number of ways. Now, before we go on to some of the other things that we can do, I'd like to talk about some maintenance tasks that I have found over the years that customers often don't get around to. This is on the next slide (9), maintenance tasks that can severely affect your performance. Some of these maintenance tasks just get forgotten sometimes, and they can cause some serious performance problems. They're tricky because often there are no smoking guns that point to them.
The first maintenance task that I have found over the years that isn't often done, is disk defragmentation on the directory of the SMS Inboxes. Think about it for a minute, what's the basic function of SMS? The basic function of SMS can be thought of as taking a whole bunch of tiny files, creating them, renaming them, moving them, and deleting them. Now, what causes serious disk fragmentation? Well, taking a whole bunch of tiny files and creating them, renaming them, moving them, and deleting them. Can I ask you a question? When was the last time you defragmented the drive that has the SMS Inboxes on it? Defragmenting your drives regularly can increase performance significantly. One customer I work with defrags their drives once a month.
The second maintenance task that is often not considered that has a very serious impact on site server performance, has to do with virus scanners. What do virus scanners do with a whole bunch of tiny files when you create them, rename them, read them, move them, and delete them? As you know, they open them up, and they battle with SMS for access and control of those files. I have seen virus scanners cause serious delays in a variety of SMS child processes attempting to process their files. I've seen it in the Status System with the .sdf files. I've seen it with DDRs, .tmp files, if they're being created, and so on.
What you can do is properly configure your virus scanner for your Inboxes. Now, virus scanners can be configured in one of two ways usually. You can either exclude directories, or you can exclude extensions. If you have the option to exclude directories, you might consider excluding the Inboxes altogether to keep your virus scanner from contending for those files.
If your virus scanner doesn't have that option, you can exclude extensions. The extensions for most of the files that SMS processes are listed in "Appendix A" of the SMS Resource Guide. There are a few extensions over the years that I have found that are not in the resource guide. I have listed them for you on the slide: the .tok, .tmp, and .chg files. Another file that I found just this morning that isn't listed in the appendix is .not files. You might consider excluding any of those extensions.
Now, those of you familiar with excluding virus scanner extensions will know that there's a limited number, in some products, of extensions that you can exclude. Which ones then should you exclude, if you have to choose between some of them? I have found, over the years, that the .tmp files, the DDR files, the MIF files, and the .sdf files are particularly important to exclude. You can watch performance in the logs of the processing of these files, and you can make a decision just based on trial and error which ones are impacting your site the most.
Let's move on to the next slide (10) to find some other maintenance tasks that you should be doing that can severely affect performance.
0-byte files can very seriously impact your site. They can cause file-processing delays. I've seen them spin the CPU utilization up to 100%. How can you tell if you have this problem? You can use the Advanced Search feature of Explorer to search on files less than one kilobyte (KB). Now, I've found the hard way that you can't search on files that are 0-bytes. Why do you think that is? These files are not really, truly 0-bytes. They may have a binary character or two in them making them slightly larger than 0 bytes. So in order to find them, what I do is search on files that are less than 1 KB. Let's say you do this, and you discover you have a variety of 0-byte files. Now, what do you do?
First off, I wanted to tell you, please don't just delete them. There are certain files that are 0-byte files by design, and you could cause yourself some serious problems. One example is the .tok files that lock Inboxes for a time while some component is processing the Inbox. You can safely delete 0-byte DDRs, SDFs, MIFs and some .tmp files in those same directories, but make sure they're aged several hours old or longer. If you have any doubt, if you've never done this before, please call PSS before you delete any files. Please, don't delete the Outbox.cpb file, the Sitestat.dat file, or any .tok file. Please be very, very careful with this. You might even just move those files to another directory and watch the logs and follow up that way.
We've talked about 0-byte files in the Inboxes. There's another maintenance task that you perhaps might not have done. Collection Evaluator has a cycle that is a lot more frequent than perhaps you need. You might consider setting the Collection Evaluator Cycle to 12 hours. Look at your business needs, think about how frequently you need your collections to be updated regarding their membership, and set your Collection Evaluator Cycle appropriately.
The other thing that you might consider doing, if you find that you have backlogs in your Despooler directory, is to increase the number of despooler threads. On the slide, I've given you the registry location where you can increase the number of despooler threads. In a variety of cases, especially with resync jobs, I have found that, when they're backed up, increasing them by a couple of threads can help reduce that backlog significantly and much more quickly.
We've looked at some maintenance tasks that can severely affect performance. I hope that you will go back and think about the maintenance tasks that you are or aren't performing on your site, whether you're doing disk defragmentation, whether you've addressed the virus scanner contending for you Inbox files, whether you've looked for aged files, 0-byte files in your Inboxes, and the other settings that I've suggested that you look at.
Let's assume that you've measured your performance, and you have found that everything on your SMS Executive and the child threads is appropriate. It's all processing appropriate things. You've looked through these maintenance tasks, and you've found that everything is done there, as much as you can do it. You've deleted your 0-byte files. You've made sure that your virus scanners aren't contending for files and so on, and you just need to get a little bit more performance out of your machine. Perhaps you've added sites or clients to your hierarchy.
At this point, on the next slide (11), I'd like to just briefly suggest that one way to reduce this performance problem is to reduce the amount of data you have collected or stored. So on the slide entitled, "Reduce the Data Collected/Stored," let's look at how you can collect less data less often.
First, consider collecting data less often. Look at the schedules for discovery and hardware and software inventory, and think about reducing them. Let's do some simple math here together. If you're currently collecting hardware inventory every day, and you change this to maybe once a week, by what factor have you cut down the data you're collecting? Well, by seven. You can cut your MIF flow to one-seventh of what it was simply by changing that setting. Often, customers have not looked through the schedules throughout their entire inventory, and they're surprised at how many of their sites are configured to collect more data, more frequently than what they thought.
Now, the second thing that you should consider is collecting less data, not less frequently but less data itself. You can use the Mofman utility to reduce how many properties you collect for hardware inventory. Ask yourself, do you really need to know who has a sound card?
For software inventory, you can reduce the number of extensions you've defined. DLLs create a particularly large amount of data. If you have a need to find just one DLL, let me tell you a little trick that we use sometimes. We send a package to copy that DLL and copy it over and rename it to an extension that we are in fact performing software inventory on. We will pick up the properties of that DLL and add it to inventory.
Let's say we've gone through our settings, and we've decided to collect less data less often. The other option that we have to improve performance, in the area of the amount of data that we have, is to reduce the amount of data collected and stored by deleting more data more often, and this is on the next slide (12). By deleting more data more often, you can keep your database lean and clean. Smaller databases are faster than larger databases.
Ask yourself for a minute, how long do you really need to keep certain aged information? Do you really need 90 days worth? You can cut your table size significantly by configuring your delete aged tasks to 30 days.
There are a variety of ways also to reduce the number of status messages. Now, we'll see in a minute that status messages can pile up in your database and cause that table to grow significantly as well. There are a variety of ways to reduce your status messages. You can reduce the types of messages that are forwarded up the hierarchy. You can reduce the types of messages that are actually written into the database, and you can reduce the period of time status messages are kept in the database. Now, you can look in the manual to find these settings.
I just wanted to point out that when you leave these settings at their defaults, you're advocating your design needs to what is considered an average site. No site is truly average. Everyone has their own specific needs, and by abdicating that decision, you certainly are having some performance impact regarding your status messages and how that table is growing and how it's performing. This is probably one of those tasks that you haven't gotten around to. I would encourage you to look at configuring your status messages and again configuring your tasks. Consider setting your delete aged perhaps from 90 days down to 30 days.
On the next slide (13), "Reduce the Data Collected/Stored," I want to show you how to truly measure how big your database is growing. Now, it might be tempting to use a SELECT query to simply return the number of rows, but that won't tell you how large the table really is. I want to ask you for a minute, how large is each table in your database? Do you know which ones might be growing out of control, because row count won't tell the whole story?
The query I've given you on the slide will give you the actual size of each table in your database. I would ask you to consider running it frequently and to watch for a sudden increase in table size or the sudden appearance of a new table that you weren't counting on or you weren't expecting; that could be from a NOIDMIF that you hadn't approved in your environment. You can then take some time to figure out why this table is growing and if necessary, take the appropriate action.
The next slide (14) shows you some of the tables that tend to get large in the SMS database. Some tables that you might want to watch are Driver History, Logical Disk History, Operating System History, and LC Services History there. Do you notice that some of these tables are Hardware Inventory History tables? If you keep less hardware inventory history, you could reduce the size of these tables. Notice also that the Status Message table also tends to get large. This is why it is important to configure how many status messages are kept and forwarded.
I hope that you will take the time to start monitoring table size. I hope you will use the information to make some intelligent decisions as to how to best keep your need for information balanced with your need for performance.
Let's move on to the next slide (15), ensuring database integrity, increasing efficiency. Databases that have had their integrity compromised run slowly. Has your database been compromised, do you know? Does your database have one of the common problems that I've listed on this slide?
Let's look at duplicate GUIDs. If your database has duplicate GUIDs, then you have trouble. Duplicate GUIDs are caused by imaging the SMS client files or registry, which is not supported. The GUID serves as sort of a primary key of the database, not really, but it's helpful to think of it that way. If you have a compromised primary key, all kinds of things happen, including high CPU utilization, high memory utilization, and machines replacing each other in collections. You may be watching a collection, and machines will disappear, other ones will take their places. One particularly nasty symptom of this problem is software distributions go to the wrong machines.
Now, a complete solution is available at the location on the slide (http://www.microsoft.com/smsmgmt/techdetails/newuid.asp) . There's a WebCast available for you, but this is one of the very first things that I would do, if I were administering a site, to ensure optimum performance on a site server. Almost everyone suffers from this, whether they think imaging is happening in their environment or not. I would ask you, have you checked your site for this problem?
Now, the second thing that you can do to ensure database integrity and increase the efficiency of your database, is to run the database consistency checks. All SQL databases, whether they're the SMS databases or not, need to be maintained, and one of the ways you maintain them is by the command DBCC, or database consistency check. If you don't do this, it could be possible that your database could become corrupted over time, and even back up successfully, but then would fail to restore, and you could be in a very serious situation should you need that data.
On the slide, I have given you a command that you should consider adding to your Smsbkup.ctl file. It's not documented anywhere, other than the slide here, and what this command will do is automate running the DBCC CHECKDB and DBCC CHECKCATALOG commands when your site system backup tasks are run. Now, if you are going to add this to your Smsbkup.ctl file, be sure to cut and paste it, and add it in that way. If you type it in for some reason, make sure that you use the proper single quote. The single quote used is not the normal single quote that you would use, and if you use the normal single quote, this command will not run properly. {Editor's note: This refers to using a "straight" rather than "curly" apostrophe, which is also called a single quotation mark).}
After you run these commands, and if you're using the Query Analyzer, you would simply put the focus on the SMS database and run DBCC CHECKDB or DBCC CHECKCATALOG ,and read the results. After you run these commands, look in the SQL error log and check for errors. Be sure the SMS Executive is stopped and all other activities in the SMS database are stopped, or you may get spurious errors.
The last trick for ensuring that your database has integrity and to make sure that its efficiency is proper, is the reference to the Knowledge Base article, Q200653. I'm using Conv20.exe to check database integrity. Let me ask you a question. Do you know that all the indexes on all your tables are in fact in place? On occasion, indexes get corrupted or they get dropped. When an index gets corrupted or gets dropped, it causes a serious performance problem depending on the table. Do you know which ones are supposed to be there and are they, in fact, there? This article shows you how to use the Conv20.exe utility to check to make sure all your indexes are there and to make sure that the schema is proper in your database.
Just as an aside note, this utility comes with your services packs or your product version upgrades, and it is used to convert the schema of the database to the next version of the product. It's safe to run, if you use the version that's appropriate to your database. You wouldn't want to do it too often, but it's safe to run, and the article will give you more information.
While we're talking about the database and ensuring database integrity, let's move on to a different part of looking at SQL Server. Let's look at tuning SQL Server on the next slide (16). When I find a site server that has performance problems, I look for two things before I do anything else. The first thing I check for is duplicate GUIDs. The second thing I check for is the SQL dynamic memory setting.
Now, duplicate GUIDs can cause the SMS Executive to use huge amounts of memory. Improper SQL dynamic memory settings can cause the SQL Server to use huge amounts of memory, often all the available memory on the box. You don't have to get into Performance Monitor or anything else to look at this. You can stop this problem pretty quickly with Task Manager.
If you do not restrain SQL dynamic memory, it can quickly grab all the memory on the machine, and what happens is you starve the operating system and everything else. You can read the slide. It gives you a general formula for how to configure this setting. Now, this isn't an official formula, but it's rather something I picked up from a friend in Microsoft Consulting Services, and it seems to work pretty well. You need to adjust and test in your own environment.
Now, we've looked at tuning SQL dynamic memory. Let's move on to the next slide (17), and look at some other settings that you might want to adjust. Even though this is not a session on configuring your hardware, I do want to mention the importance of properly separating your database, your log, your master database, and your SMS directory. I'll refer to some documentation later in the slides that will help you know how to separate these properly and when to separate these properly, but I wanted to mention it here because it's an important SQL Server tuning factor.
Now, moving to the next bullet on the slide, every so often on one of our internal SMS e-mail aliases, I see someone ask a question, and the question they ask is, "How do I set processor affinity to optimize SQL Server performance for SMS?" The product group answers these questions, and they always answer this particular question in the same way, do not do it. SQL Server is very efficient in managing which processor handles which task, and that's what processor affinity is. It allows you to set a particular thread to have affinity with a particular processor. SQL Server is very efficient in managing which processor handles which task, and you are likely to cause more problems than you resolve, so don't do that.
If you must play with some SQL settings, I've given you some settings here that you should look at: locks, open objects, user connections; make sure those are all set to 0, which is autogrow. As far as any other settings go, the basic rule of thumb is, look in the SQL error log. If the SQL error log shows a particular problem, then you may need to adjust that setting. One setting that comes up rarely, but does come up in the SQL error log refers to the number of maxworkerthreads. If you see that the maxworkerthreads need attention in the SQL error log, you can bump that setting up a little bit, but the general rule of thumb is, if you don't see a specific error in the SQL error logs, you don't need to adjust anything else. The locks, the open objects, the user connections, all those things will show up in the SQL error log. Set your device configuration properly, set your SQL dynamic memory properly, set those particular settings we adjusted as they showed up in the SQL error logs, and you should be fine regarding tuning SQL Server.
Now, before we leave SQL Server altogether, I'd like to move on to the next slide (18) and talk about query tuning. Query tuning is kind of a forgotten SQL performance factor. For those of us whose primary job is not SQL Server or SQL Server development, we might not even think about query tuning as being important. That's too bad, because it is.
I took a class in optimizing SQL performance awhile back, and it surprised me and amazed that most of the class was not about going in, looking at all the settings, tweaking those settings, and getting it right. Most of the class was about how to write efficient and fast queries. I want to ask you, "How many of you have thought about the queries you have created in SMS this way? How many of you have tested different ways to write your queries to see which ones run faster?"
I've listed some symptoms of poorly tuned queries: the results come back slowly or you don't get all the results back; in other words, the Results pane is not completely populated with machines that you know it should, or you run the query multiple times and get different results back each time. Another symptom is, the SMS Administrator console may all of a sudden run very slowly while that particular query is running, or you might be monitoring your tempdb size in Performance Monitor or another way, and you discover that it grows astronomically during your query, and then comes back down to where it should be. All these are signs of a poorly designed query.
We're not going to get into a lot of detail about this. I'm just going to tell you basically what causes the problem very briefly, and a trick as to how to redesign a query so that you don't have this kind of performance problem.
The common cause for a query impacting your system is too many joins on array properties. What do we mean by that? Array properties are properties where there is more than one value for a particular object. Let's take IP addresses, for example. You can have a single network card that has multiple IP addresses associated with it. That is an array value.
If you have too many joins in your query on these kinds of array properties, you can get huge result sets back. I won't go into the math, but there is some math behind that.
What I do want to do is tell you a trick about how to identify these array values, and also how to adjust your queries so they don't contain array values. The simple way to identify array values is to go into SQL Enterprise Manager and look at the tables and see which properties have drop-down boxes. The properties that have drop-down boxes are array values.
Now, look in your query. Your query is written in WQL rather than in Transact-SQL. In order to translate your query from WQL to Transact-SQL, so that you can reference the table and see what's going on as far as an array value, is to run the query and then very quickly look in the Smsprov.log. In that log, you will find the WQL query translated into an SQL query, and you'll see the tables that are being queried. Compare that table to the WQL table or the WQL that is in your query that you have written in the interface, and if any of those contain array properties and they're slow, you need to redesign your query or consider redesigning your query.
Without getting into query design, a simple trick for that is to simply continue to remove items from your query until the query is fast. Then, add them back in until you find the problem. Then, consider a different way, find a different way, to write that query so that you're not joining on an array property. A subSELECT from another query is a good way, sometimes, of getting around that problem, so enough of query tuning.
Let's go on to look at another aspect of SQL Server or SQL that might be impacting your site. Let's look at reporting and how reporting might impact your site, and reducing the impact of reporting against your sites. This is on the next slide (19), reducing impact to reporting.
To reduce the impact of reporting on your site server, you can do two things. First, you can move the data to another optional location off of the site server. The second thing you can do is implement reporting through your views.
I want to ask you, "Is reporting causing deadlocks on your central site?" It's not uncommon for a company that's been running SMS on a large central site server for a while and they've been running it, and everything's been fine, but that word gets out about all this great information that's available in the database. More and more people start running reports against the database and then, things start to slow down. This is because SQL is busy running all those poorly written queries that join on array values, perhaps for one reason, but it is also because of deadlocking.
Deadlocking is caused when two attempts are made to access the same records at the same time. Heavy reporting can cause deadlocking and impact performance. So how do you know if this is a problem on your site? There's a very simple way: Look in the SQL error log. It will show the deadlock errors.
Now, there's a very simply way to get around this problem. Simply move your data off your site server before you do reports on it. One option might be to do a nightly dump of the database and load it up on another server and run your reports there. You need to assess your business need for how timely the information needs to be for reporting versus the impact it has on your site server.
The second thing that you can do to increase the efficiency of reporting, is to eliminate the provider layer with views. We include it as of SP2. We included views that will allow you to directly query SQL. These views are faster than going through the provider layer, and they're safer, because rather than trying to figure out our schema, and rather than trying to report directly against our table structure, you're reporting against the views, and queries and reports will survive, for the most part, schema updates. The views will continue to stay consistent as the service pack gets applied, whereas if you're trying to report directly against the tables, they may not. So please report only against the views.
Reducing the impact to reporting. Consider moving your data to another database as a dedicated reporting database. Consider reporting against the views for faster queries.
After you have done all this, and your site server perhaps is still having trouble keeping up, you know that your hardware is where it's (the problem) going to be. You've collected less data less often. You've deleted more data more often. You've troubleshot all of the problems, and perhaps you're growing. You have merged new sites in, and you need to just get a little bit more performance off your site server. You should consider moving services off the site server.
On the next slide (20), when we have eliminated those bad things that are impacting our performance and we're starting to move services off, the first thing that you should perhaps avoid doing is avoid moving SQL Server. Sometimes, people think that separating SQL Server from the SMS site server will help them out. This has been, in fact, a recommendation and a recommended configuration in past versions of SMS; however, testing has confirmed that you should avoid running SQL Server on a separate machine.
Now, there are a couple of reasons for that. The first is, there's a lot of data being transferred or a lot of query information going back and forth between the site server and the SQL Server. You may bog down your network sending all that data between the SMS site server and the SQL Server. Second, the most likely bottleneck in today's high-speed systems will be your network anyway. So you may be better off to keep SQL Server and SMS on the same box.
If for some reason you have to do this, there are some suggestions. First, use a dedicated network connection. By this I mean, put an extra card in your site server, an extra card in your SQL Server, and put a hub between them. Make sure that nothing else is on that particular segment, so that you're not loading down other machines or impacting their network connectivity.
The second thing that you ought do, if you have to do that, is to set the default network library on the SQL client on both boxes to TCP/IP sockets. That will improve your performance as well. That's in the SQL program group on both the SMS machine and the SQL Server, set the default network library to TCP/IP sockets.
Now, having addressed that, let's talk about some services that you should perhaps consider moving off the site server on the next slide (21), "CAPs, DPs, and Senders." If your site is optimized performance and is still pretty busy, you can, and perhaps should, offload some services.
You can move the CAP off the site server, and I would do this anyway, as a matter of good practice. While we're talking about CAPs and performance, I want to throw in a trick for you. Some folks have found that a CAP-to-client ratio of about 1 to 500 works well. So you might consider adding a CAP for every 500 clients. This has more to do with how efficient your clients are in accessing the CAPs, but as long as we were talking about moving CAPs off the site server, you might think about adding additional CAPs as well.
The other thing that you can move off the site server is the distribution point. Again, I would do this anyway.
You can also move the sender off the site server. Did you know that you can configure multiple senders? If your site-to-site communications are a little sluggish or not timely, you should look into this. What's more, you can move the senders on to other machines to optimize performance. I don't want to go into great detail on this now. For more information about how to do this, see Chapter 8 in the SMS Administrator's Guide.
When you've done all the troubleshooting, and you've removed whatever problems you find through monitoring the Executive, the Inboxes, and so on, when you've managed to increase the efficiency of the database by running the database consistency checks, removing duplicate GUIDs, making sure SQL is tuned properly, and moving all your services off your site server than you can, when you've done all those things, and you're just at the end, you need to do something else. It might be time to consider adding hardware.
On the next slide (22), "Hardware Sizing and Configuration," I've given you some recommended reading. The System's Management Server 2.0 Resource Guide, Chapter 2, has some excellent suggestions regarding site server hardware sizing. The SMS Server Administrator's Guide also talks about some hardware issues. An excellent resource is the "System's Management Server 2.0 Server Sizing in an Organization" white paper (http://www.microsoft.com/smsmgmt/deployment/smssizing.asp). An aside about that particular white paper is, if you really want to get further into performance monitoring for performance tuning and looking at what's going on at that level, this white paper has additional information about how to do that.
Let's move to the next slide (23), "Hardware Sizing and Configuration." I'd like to talk about a few things briefly regarding hardware before you just buy more processors, more RAM, or something like that without really looking at it first.
If you're going to upgrade your hardware, I would ask you to consider adding disk subsystems first. Remember what we said SMS was? SMS creates a whole bunch of tiny files and moves them, renames, reads, and deletes them; and it's also a database, so disk access is very important.
If you're adding hardware, do your homework. Look at the recommended reading and do some of the testing, but you might find, with some of that testing, that some of the suggested disk subsystems that you would want to add, would be to put your database on RAID 5, to mirror your log devices, and then follow the other guidelines and recommended reading.
The quote on the slide is from one of our product group people who specializes in performance, and what he said was, "They don't make a single disk system fast enough to support large SMS sites." I want to ask you, "Is your site large, and how many disk subsystems do you have, especially on your central site?" For more specifics, please see the recommended reading and follow the recommendations there.
On the next slide (24), we're going to look at CPUs. Often, I hear customers state that they're planning to improve performance by adding processors. They do this as a first step, and they're disappointed that they didn't get a significant return on that investment. So I'd like to give you some guidelines on this slide for when to consider adding processors.
Consider adding processors if, and only if, the RAM is adequate. Now, remember your cache hit ratio. You need to make sure that the database is being hit at a cache rather than off the disk. So is RAM adequate? Is the disk contention relieved by separate disk subsystems? If you can say yes about those things, you might consider adding processors.
The quote on the slide is again from the product group. "There are probably less than 10 central sites in existence that might need to consider Datacenter, SQL Enterprise edition, and more than 4 processors." Now, I hesitated to include that quote, because I remember a few years ago that a large SMS site was some 10,000 to 20,000 machines. Now, a large site is over 100,000 machines, and I think that the sites that were being referred to there are considerably larger than that. In a year or two, who's to say what a large site might be at that point?
So that's all that we're going to do regarding talking about hardware sizing and configuration. In fact, we've covered all 10 aspects of measuring performance and looking at how to increase database efficiency, Inbox processing efficiency, and some of the maintenance tasks that you need to be doing to ensure that your site is running properly. What I'd like to do now is have you think about all 10 aspects and consider that with these 10 things, there are usually bits and pieces of each of them that may be impacting any one particular site server.
That's why I included, at the end, a case study. What I want to do is, I want to lead you through some notes that I took about a year ago at a customer site when we were troubleshooting a central site server and some regional site servers that were having performance problems; and to show you, in a practical way, how these 10 different aspects that we've looked at work together to solve a problem in a real-world scenario.
So, let's move to the case study (slide 25) and look at the problem. When we first began looking at this hierarchy, the central and the regional site servers were just unresponsive. We tried to access them with the SMS Administrator console, and it would either simply hang, or it would take forever to connect to the database and to open different objects. The SMS Executive was hanging on a regular basis or using huge amounts of CPU. The Inboxes were backlogged. Both the MIF directories and the DDR directories had much more than 100,000 files in them each.
Now, this site server hardware consisted of quad-processor machines with 2 GB of memory. So, these were pretty good machines. In the whole site, in the hierarchy as a whole, there were about 60,000 client machines at that time, and the central site had 60,000 in the Discovery table.
As we started looking at this problem, the first thing we did was to start Task Manager on the central site server to determine if there was a problem with CPU or memory utilization on that server. You might remember that from the first performance measurement slide.
The results of the test with Task Manager are on the next slide (26). We were trying to determine if the CPU or the memory usage was guilty of impacting the performance of that box. Task Manager immediately showed that SQL was using a gigabyte of memory. The SMS Executive was using 821 MB, and the SMS Executive was pegging the CPU to 100%. Obviously, we could tell immediately just from Task Manager that the system was RAM-starved and CPU-bound.
With the SQL memory usage at a gigabyte, what do you think the first thing I did was? The first thing we did was, we addressed that SQL RAM over-utilization. We went into Enterprise Manager, and we set the maximum memory setting down to 500 MB. It had been set at a gigabyte, and that was too much. It didn't need that much.
The second thing we did, after addressing the SQL memory usage was we ran the Transact-SQL SELECT query, that's in the duplicate GUID solution I referred to, to look for duplicate GUIDs, because we looked at that SMS Executive memory usage at 821 MB, and it was about maybe 800 MB too much, and duplicate GUIDs can cause that. So let's take a look for those. So we ran that Transact-SQL query, and we found thousands and thousands and thousands of duplicate GUIDs.
Now, the customer, at this point, we were talking about it, and he didn't want to clean up the duplicates right then. He wanted to send out a package to clean them up on the clients, and if we cleaned them right then, we wouldn't have any targeting information. So I said, let's just leave those there and see what else might be impacting the site server.
So what we did was, we moved on to that next measurement of performance, which is looking at child services of the SMS Executive to see which ones are causing that SMS Executive to peg at 100%. We started Performance Monitor. On the next slide (27), we see the results. Here, we're trying to determine the guilty SMS child service. We loaded all the SMS Executive Performance Monitor counters, and then we started the SMS Service Manager.
Now, we cycled through each of the services in Performance Monitor and using the highlight button, we discovered that two of them were pegging the CPU to 100%. We needed to find out which two SMS child services were those two services that were pegging the CPU to 100%, which two were guilty.
What we did, rather than trying to correlate the thread ID to the logs and so on as explained in the article, a little trick is to stop each of those SMS child services using the SMS Service Manager tool, while watching the Performance Monitor chart. When we stopped Dataloader, and I suspected Dataloader because of the duplicate GUID problem, one of those two threads that had been right up at 100% immediately dropped to 0. So, we knew that Dataloader was one of the guilty services.
We stopped some other threads, and then we got to Data Discovery Manager. When we stopped that thread in the SMS Service Manager, the other chart line that had been at 100%, immediately dropped to 0. So we found out which SMS child services were guilty of causing this problem: Dataloader and Data Discovery Manager.
Now, before we fixed anything and before we changed anything, we wanted to get a baseline to know how each change impacted the performance, how much of a factor each change we made was to impacting performance.
So we wrote down our RAM settings and our CPU settings and so on, but we also ran the SQL Performance Monitor counter, transactions per second. Remember, transactions per second tell us how efficiently data is being loaded into the database. The results of this query were 2.3 transactions per second. This is an extremely terrible rate, by the way.
We decided to start with Dataloader. We opened the Dataloader log and took a look at Dataloader to see what was going on. We have our baseline. We're going to now begin to address each guilty child service and see if we can fix whatever is going on that's wrong.
So, on our next slide (28), we viewed the Dataloader log with Smstrace. Smstrace is one of the support tools that you can download from our Web site. We viewed Dataloader log with SQL Tracing enabled. Now, I gave you the article previously on how to enable SQL Tracing. We noted in the Smstrace log, while watching the Smstrace log and watching Performance Monitor at the same time that the CPU spikes occurred while updating history on specific machines.
We looked for a minute at our history, our aged tasks, and we reconfigured our aged tasks to keep history, instead of 90 days, we configured it down to 30 days. We noticed a minor improvement. In fact, we got, instead of 2.3 transactions per second, we now had 5 transactions per second. Now, 5 transactions per second from 2.3. transactions per second, that might be 100% improvement, but it's still terrible.
One of the other things, while we were watching the Smstrace utility, while we were watching Dataloader log, we noticed large stalls when Dataloader was creating status messages. These stalls went anywhere from 20 seconds to 2 to 3 minutes, which is too long. We thought, what could be contending? Why could these status messages be causing a stall?
I thought, I wonder if they've excluded their virus scanner from this directory. We went and looked at the virus scanner configuration, and we discovered that the .tmp and the .svf files had not been excluded. In fact, it was the .tmp files that were not listed in the appendix, that were causing the problem. So we excluded the virus scanner from the .tmp and the .svf files. Again, our transactions at that point had been up to 5 per second, and the stall went away.
As we move on to the next slide (29), we see at this point that we decided to clean up the duplicate GUIDs. We had done everything we could for Dataloader from the perspective of looking at the logs, looking at the files that were being processed and so on, and it was just time to clean up the duplicate GUIDs. So we cleaned up the duplicate GUIDs using the Transact-SQL script.
CPU utilization, which had been pegged at 100%, immediately dropped to between 30% and 80%. The SMS Executive RAM, which had been at 821 MB, dropped down to something like 20 MB. The interesting thing I thought was, the Transactions per Second counter was now spiking between 70 and 90. I think it was obvious, at this point, that we had found the major impact of the server, duplicate GUIDs.
However, if you remember that we had also determined that Data Discovery Manager was spiking the CPU to 100% occasionally, so we weren't going to stop there. We decided to look into that, as we see on the next slide (30).
We addressed the guilty SMS child service, Discovery Data Manager. We monitored Discovery Data Manager in Performance Monitor. Remember, we knew which thread it was because we had stopped and started the service in the SMS Service Manager.
While we monitored Data Discovery Manager, we looked in the DDM box. We found a 0-byte file that was several weeks old. We deleted that file. Now remember, be careful about deleting 0-byte files. Please, contact PSS before you do that, if you're not sure. We deleted the file, and DDM settled down to spikes less than 50%.
Now, we wanted significant CPU utilization there, because we had over 100,000 DDRs backlogged; however, the 100% utilization was inappropriate, so we had needed to address it, we did, and we got it down to where it should be.
Now, we got thinking, look, we found a problem in one of the Inboxes. We decided to look in all the Inboxes, look for 0-byte files and other things, as we see on the next slide (31).
When we got to the Despooler\Receive directory, we found a backlog of 108,000 files. At this time, we took a little break, and I was remote. So the customer zipped up the files and sent them to me. I opened them up in my favorite tool, Notepad, and discovered they were mostly resyncs, and again from the duplicate GUIDs.
We didn't want to just delete those. That's not a good idea. So what we did was, we increased the Despooler threads from the default of 3 to 8 in the registry. I referred to that registry setting in one of the previous slides.
Immediately, Despooler increased in CPU utilization. Now, that's good because we want Despooler to be working harder and more. So it's a good thing when CPU utilization goes up in some cases. The Despooler log showed faster processing now and multiple threads.
So we went on to the other Inboxes, as the next slide (32) shows. In the other Inboxes, we found some 0-bytes files. We found old UDC, RPT, and RPL files. We backed them up and then deleted them. Again, please don't do that without checking with PSS, if you're not certain. I also want to mention, there are a couple of files, the Outbox.cpb file and the Sitestat.dat file and some others, .tok files, that are 0-bytes by design, so be very, very careful. Call PSS if any doubt.
The one thing that we had discovered also in doing this, that I didn't tell you, is that when I initially had the customer search for 0-byte files, he said, "I don't have any." Then, he zipped up the directory and sent them to me. I found them, and I said how did you search on the 0-byte files? What happened was, he had used Explorer and the Advanced Search feature and searched on 0-bytes rather than 1-byte files. Remember that trick, use search for files less than 1 KB rather than 0-byte files, because many of these files are not truly 0-bytes.
After reviewing the rest of the Inboxes, we went back to Performance Monitor to see what the rest of the SMS child services were doing. We had pretty much cleaned up Dataloader for the time being. We cleaned up Data Discovery Manager. We cleaned up some Inboxes and 0-byte files and some aged files. We had increased Despooler.
We wanted to see if any of the other child services were doing anything inappropriate and if we could fix it. So how did we do this? We watched Performance Monitor and we watched Smstrace and the corresponding log at the same time, as we see on the next slide (33).
What we were doing was reviewing each of the child services to see if the activity that they were performing was appropriate. The first child service that we found that seemed to have some activity that was inappropriate was Sender. While we were watching the log, the Sender was spiking the CPU every three seconds to 100%. The interesting thing was that in the Sender log, at the same time, it was showing attempts to connect the sites that were no longer attached to the hierarchy and had, in fact, been decommissioned and had gone away.
What we did, there was a bunch of jobs scheduled and running for those sites, we ran preinst /deljob for those sites. Preinst is documented in the resource guide, and if you're not absolutely sure how to use the utility or when to use the utility, please call PSS prior to doing something like that. We did a preinst /delsite for those sites that were no longer attached to the hierarchy.
Then, we saw the Sender log was still spiking every three seconds, not as much, but still a little bit, and the logs showed that it was spiking on some sites where there were no addresses. There were errors saying, we have no address for this site that we're trying to send this job to. We went into the interface. We created the addresses for those sites, and now the Sender CPU utilization dropped down to 0 percent.
Having done that, we went through all the rest of the child services that showed CPU activity, but we found nothing else that looked inappropriate. At that point, we decided we'd pretty much found the source of our performance problems, and we'd resolved them. We had fixed the SQL dynamic memory problem. We had deleted duplicate GUIDs from the database. Now, we hadn't cleaned the clients yet, that still had to be done, but we had aleviated the immediate problem, the duplicate GUIDs. We fixed a problem with the virus scanner configured to scan files that SMS was contending with. We found a variety of 0-byte files. We cleaned them up, and we found that Sender was misconfigured, and we corrected the problem. After we had done all of this, and by the way, this took place over the course of a couple of days, after we had done all this, what was performance like? Let's move on to the next slide (34).
The SMS Administrator console was responsive. It was quick, it was fast, and it was accurate. The CPU utilization was now running between 15% and 30%, sometimes up to 80% processing those backlogs, very brief spikes to 100%, but that was okay. The RAM utilization was normal. We saw no more growth. We monitored it in Task Manager very quickly and just watched the memory usage, and we had no more growth. Our transactions per second were between 30 and 70, which is a good rate in my experience. Now, other sites may have different rates that are optimal for them, but from my experience, that's been pretty good, 30 to 70, with peaks to 90. Our backlogs got processed in about two days. We had hundreds of thousands of files in the DDR and the MIF directory, and those got processed in about two days. About a week later, the customer moved the log to a separate drive, and we increased performance by an additional 20%, and that was evident in the Transactions Per Second counter. Obviously, things went well here.
I hope the case study has helped you get a real good sense about how to go about troubleshooting site server performance problems. I also hope that you'll go back to your own production environment and check for some of the common causes of problems that we've gone over today. Look for duplicate GUIDs. Look for misconfigured SQL dynamic memory. Look to see if your virus scanner is properly configured. Monitor your Inboxes for aged files and 0-byte files. I trust that you'll all be able now to use Task Manager and Performance Monitor, the SMS logs, and the various SQL queries that we looked at today, to properly troubleshoot any of the problems you may have in the future.
I really want to thank you for your attention. I hope this was helpful. Now, I'd like to turn the session back over to our moderator.
Jason Bennett: Thank you so much for that presentation, Bruce. Okay, just a couple of quick notes before we move on to the Q&A portion of the Support WebCast.
If you'd like to have a copy of the slides, be sure you download the files from the Web site. Also, if you would like to review the content again, we have the on-demand streaming media available. All of this content will be available from the Past Support WebCast page. To access information on all upcoming Support WebCasts and the archived content from all past WebCasts, an easy to remember URL is http://support.microsoft.com/webcasts/.
The Q&A portion of the Support WebCast is intended to encourage further discussion of the Support WebCast topic. One-on-one product support issues are outside the scope of this Support WebCast. If you do need technical assistance, please submit an incident on the Web, or call Microsoft Product Support Services and speak to a Support Professional.
Joining us today on the Q&A is also Wally Mead. Wally has been with Microsoft for nine years. He started in the training organization, where he delivered training on LAN Manager. He began working with SMS in 1994 and has been involved in the development and delivery of all SMS training courses for Microsoft. He now works for the SMS product group in the Early Adopter Program, and provides training for EAP members.
We have gotten quite a few questions already in the queue. The first question, How do we know that PerfMon is a 15% performance hit? Is this running PerfMon locally or on a remote machine?
Bruce: The question really is how do we know that figure is accurate, and the only way, that figure was determined pretty much by experience and common knowledge of support and the engineers that have worked with Performance Monitor in the past. I don't have any hard and fast data for you. I would like to mention, with the SMS Performance Monitor counters, occasionally they can cause memory utilization problems, run over long periods of time. So I don't have a hard and fast documentation for that, but it is pretty much common experience amongst the support engineers.
Jason: As far as excluding folders from the virus scan on site servers, what kind of risks are associated with doing this? Given the recent virus issues, unless we can demonstrate very low risk, our security folks would balk.
Bruce: The risks are fairly low. Most of the files that you are looking for in a virus attack have to do with a .exe or those files that, when executed, perform actions. The SMS files that I have listed for you are passive files. In other words, if you tried to act on them, if you double-clicked on them or something like that, they don't do anything to the box. The worst that could happen is that a file could become corrupted. Let's say a MIF could be corrupted or something like that, and it would become a bad MIF or it would not be recognized by Dataloader. {Editor's note: SMS has a special folder for MIF files it can't process, which is called Badmifs.} Having said that, you need to do some due diligence on your own part as far as asking yourself, "If this file were infected or corrupted, what would be the impact to the organization" The only thing that I can see, in that these are passive files, is that it would corrupt the file to the point where SMS could not process them.
Jason: Okay. Why are WBEM Query Language queries so much slower than SQL queries? Is there anything that can tune WMI, or should we expect lower performance from it?
Bruce: WQL queries intentionally go through the provider. The provider is one more layer of processing between the query and the database. The advantage of using WQL is that discreet security can be applied to WQL, and it abstracts it from the database so that, should you be writing queries directly to the database and the schema changes, you would have to rewrite those reports or rewrite whatever it is that you were writing to query the data. That is why WQL queries run slower. They have to be translated to Transact-SQL and they have to run through an additional layer. The advantages are discreet security and the schema is preserved with product upgrades.
What can be done to make sure that your queries run efficiently? Make sure you're on the latest version of WMI. Make sure that your MMC Console is at the latest version, 1.2; that's especially problematic on Windows NT® 4.0. If you have NT 4.0 and you've never upgraded MMC, then really it's the MMC interface that is perhaps bogging down your query. You can test this by running the query locally on the site server versus running it on a remote machine, and seeing if there's a difference. If there's a difference, then it's the network or the MMC connectivity issue, but again, that is why we provide the views for reporting. If you need to get faster reporting, you can report from the views. Again, you lose that discreet ability to set security that the provider enables you to have.
Jason: Okay. We have several tables in the SQL database that were created by the Altiris Notification Server Export. How do we delete all the tables out of the SMS database without corrupting the SMS SQL table structure?
Bruce: There are some resource kit utilities, Delgrp is one of them, that enable you to remove tables that were created by NOIDMIFs either that you created that you no longer want anymore or from other products that create NOIDMIFs and ended up in your hierarchy. If you look at the Delgrp in the resource guide and also search on the Knowledge Base on Delgrp.exe. You'll find a number of articles that tell you how to run this utility. One caveat on that, if you're running SQL 2000, you might want to contact Support. There are issues regarding how you have security set up and whether or not the utility will run correctly. So in order to answer the question succinctly, there are some resource kit utilities that allow you to remove these tables, Delgrp.exe being one of them.
Jason: Is there a tip to help us understand and translate information from the client's log files?
Bruce: The client logging can get overwhelming because there are so many different things that are happening. It can get overwhelming regarding what to look at. For each set of logs, there are tricks regarding what to look for and what not to look for. I guess the best way is to refer you generally to the flow charts in the SMS Administrator's Guide and the SMS Resource Guide. Look at the flow of the particular process that you're interested in. If you're interested in, for example, software distribution and how the client reads the offer files on the CAP and which logs those should appear in and what should happen at that point, then the flow charts will show you which logs to focus on.
I will give you one tip that has been particularly helpful to me. The Status System will often give you milestone or error events that pinpoint a point in time that a particular condition occurred that you're interested in. Say for example, you're on the client and you're attempting to install a client, and the installation fails. You get "access denied" or you get some other problem with the installation. In the Status System, you may get a status message back regarding that. That status message will have a particular date and time or a particular status message ID in it. You can then go back to the client log and look at the appropriate client log that you discovered from the flow, and look at that specific date/time or query for that status message ID, because often that's logged in the log. That will show you exactly in the log where you are at and enable you to look above and below in the log to see where you are at in the flow.
We could spend probably several hours talking about which logs to look at and what special things to query for and so on. I just want to encourage you to think about some generic things you might query on such as, if you're doing software distribution, query on the package name or query on the program name or query on the advertisement ID. If you're looking at the server logs and you're trying to find a reference to a particular machine, query on the machine name or some specific item like that, and it will help you greatly look through those logs and make sense of them. Again, I'd point you back to the flows, the system flows and the client flows as documented in the administrator's guide and in the resource guide.
Jason: Okay, this user actually had several questions about the log files. So I want to apologize in advance if you've already answered this. Where can we get information on what each log file will do on the client workstation?
Bruce: In the system flows in the resource guide and the manual. The other thing is one more note on that is that it's impossible to document every logical tree that might particularly occur in a configuration. So what I often do is, I will re-create the same issue successfully and save those logs out, and then compare those logs to my unsuccessful logs and the flows to find out where we went wrong, what changed, what's different. That will often isolate the problem.
Wally Mead: One more note on that before we leave, this is Wally talking. If you want more information on log files, you can try attending the Microsoft Official Curriculum course, the MOC course, it's number 828. In that course, we spend a lot of time going in and looking at log files on the server end, on the client end, and comparing the log file references to the Status System. So in that class, you spend a lot of time looking at the Status System to see which log, like Bruce mentioned the milestones that are logged there, the date and time, and the instance of when something initiated or completed, and then you look at the log files.
So what you're usually seeing in that class is, as Bruce mentioned, capturing successful log files. You're looking at log files showing you how a process should be completed and what the steps are in the process. Then, you can use that as a reference to go back to your own site when you're trying to troubleshoot some problems. In the class, we saw log files that looked this way, and now my log file is showing something here, whether it designates the word "error", which it may not designate, or just some other condition, at least you have a reference to compare something good versus something bad. As Bruce also mentioned, the flow charts in the administrator's guide and the resource guide are very good for that as well.
Bruce: I want to add one more thing. I just realized it. Wally sparked a thought in my mind. As far as client logs go, I wish I could ask the person who asked the question whether they were referring to software distribution or to client installation. One thing that is helpful to know is that, if you are installing clients by remote installation, a lot of the information that you need to troubleshoot the problem is in the Client Configuration Manager log on the site server, rather than on the client. Its partner log is the WN_remote.log (Windows Networking Remote client installation) on the client. So if you're trying to troubleshoot a problem with client installation, you need to look at both the CCM log, search on the machine name, and the WN_remote.log on the client to get the full picture.
Jason: Great. I have three primary sites and one central site. One of my daily tasks is to SQL query the central site to pull the site code, type flag, and serial number, and then run the same query on the primary site. On the central site, the type flag may be set to two or there may be multiple entries for that site. If the serial number is not the same on the primary and the central site-to-site, communication is degraded. Can you explain why this may be happening, or how this information is used for site-to-site communication from child to parent?
Bruce: Was the first part of the question referring to the Site Control file serial number?
Jason: Let's see.
Bruce: I think what is being asked is, we have sites that are sending transactional data back and forth via the Site Control file. Each of those Site Control files has an identification number or a serial number that identifies which transaction is happening. Now, if those transaction IDs get out of synchronization, the site communication is impeded.
The thing that pops to my mind immediately with that is that there was an improper backup and recovery that occurred, where the transaction IDs were not reset properly. If that is the case, what will happen is that, until such time as the transaction ID counter in the database increments to where it before the backup and recovery happened, the site-to-site communications will fail.
If you're in that scenario where you have that problem, be sure to look up on our http://www.microsoft.com/smsmgmt/, find the Site Recovery Wizard and go through that. It will show you how to properly set those transaction IDs. If you need more immediate help on that, please call PSS. They're aware of how to get that communication restored, how to set those transaction IDs so your site-to-site communication will be restored. Wally, can you think of anything else that the question might have been referring to?
Wally: Yes, he is referencing the Site Control file, and Bruce is correct in that every transaction that goes from a child site to a parent site or parent down to a child, may be given a transaction ID and serialized. Then, the receiving site will verify that that transaction ID falls within the boundaries it's supposed to. In other words, I'm stating I'm not going to receive a Site Control file unless there's this serial number. So if I receive one that's got a serial number less that what I expect, then I just throw it away and ignore it.
So you may not be seeing things that you're expecting to see, because they're invalid serial numbers, and that can be because of delay in processing or you did a recovery process and didn't reset those serial numbers properly. So, what I would do is, like Bruce mentioned, I would go up to the Recovery Web site, http://www.microsoft.com/smsmgmt/techdetails/recovery/, and then look at that to see what it shows as far as how you adjust your serial numbers.
Outside that, it's just a matter of monitoring in the log files to see if you're having those issues, but I've never heard of anybody doing those queries on a daily basis to pull off the serial numbers and so on, because you do get different serial numbers, and there are different type flags for deltas versus actual Site Control files that do get passed up normally as well.
Bruce: I just want to put a plug in for doing backup and recovery properly. In a hierarchy, you cannot simply restore a backup and expect communications to work properly. You must follow the steps in the wizard, because SMS communicates site-to-site in a transactional manner and keeps track of those transaction IDs. So you must follow the wizard and you must be using the SMS Backup Task to get a properly restorable site. You can't just stop the server and back it up in a hierarchy.
Jason: Any chance of publishing a cross-reference between SQL view schema and WQL tables used in MMC queries?
Bruce: The policy of the product group has been no. The reason for that, I believe, has been many experiences where customers have written, and invested incredible amounts of time and money, into writing reports or doing other things that directly access the database and then we implement a schema change, either through a hotfix, a service pack, or a product upgrade, that renders all that work, all that time, and all that money invalid, and causes a great deal of pain. Now, I can say, and I did mentioned a trick for a specific instance, you can look at the Smsprov.log, and it will show the translation of a WQL query to a Transact-SQL query.
Now, also, there is the statement that the schema is not published for the SMS database and will not be. I'd like to qualify that and say the WQL schema is published. It's in the resource guide, and there are additional items regarding the schema and how to use it in the Platform SDK for SMS that allow you to do some very powerful things with the database. Any customer using the SDK to perform operations on their site directly has all the information they need in the SDK to properly access the schema of the database, in a supported manner, and accomplish what they need to accomplish.
Jason: I'd also like to take a moment to solicit some feedback from our audience. If you do have suggestions for topics for future Support WebCasts, you can send any feedback about the WebCast that you've seen so far, the interface that we're using, anything along those lines, go ahead and send it to feedback@microsoft.com, and just put "Support WebCasts" in the subject line, and that will get routed to us. We do take your feedback into account when planning these WebCasts and just improving the program overall.
So next question, Is there a release date yet for Topaz?
Bruce: Topaz is releasing quarter one/quarter two of 2002. There's no release date in that it's a quality-driven release, and it will depend on the feedback from the Early Adopter Program as to when the product is ready to release, but it's looking like quarter one/quarter two has been what has been announced by the product group in public forums.
Wally: Still, as Bruce mentioned, it's still driven by quality. So even though we might be stating quarter one/quarter two, it's going to depend upon what feedback we get from the early adopters. So if they state something is not ready go, then we will not hold to quarter one/quarter two. We will wait for a good, quality release.
Jason: Are there any issues running DBCC commands, such as CHECKDB or CHECKALLOC, from SQL commands of the SMS Admin console? Also, it's my understanding that in SQL 2000, running the DBCC CHECKDB actually performs CHECKALLOC and CHECKTABLE. Is this correct?
Bruce: Yes. If you query on the Transact-SQL Help, you'll find that there are some differences between how DBCC operates in the 6.0/6.5 SQL releases and 7.0. Some things are in fact rolled into the DBCC CHECKDB for SQL 2000. I think it's also referenced in the Administrator's Guide or the SMS Resource Guide. So please check your Transact-SQL Online Help to find out which ones those are in. I don't remember off the top of my head. I'd have to go back and look myself. The two commands I gave you, I believe, are valid for SQL 7.0.
Jason: I did want to make a comment. We have gotten a couple of really one-on-one support questions, which are outside the scope of the Product Support WebCasts. So, if you've got a really specific question about your configuration, you really ought to call into Product Support Services and speak to a Support Professional. You can also go on the Web and submit an incident at http://support.microsoft.com/.
Bruce: I'd like to mention too that if you have a Technical Account Manager and you would desire to have a workshop done on SMS performance tuning that's based on pretty much the same content as what was given here today, but with a lot more hands-on, please contact your Technical Account Manager to get more information.
Jason: Okay. What defrag utilities are available for NT 4.0 site servers?
Bruce: I believe Windows® 2000 is the first time we implemented an integrated defragmentation utility. You would need to use a third-party disk defragmenter for NT 4.0. I can't give you any specific recommendations regarding a particular third-party product, but I'm sure there's a policy or a specific site license or a particular product that your large corporation uses, and you might check with your IT department to find out what the appropriate one is.
Jason: Are there any known SMS customers using SAN storage for SMS site servers?
Bruce: That has been a question that has come up recently, in that fiber channel and SAN, storage area networks, are very high performance. People have been tempted to put the database, and so on, on storage area networks.
The issue has been, is this supportable? That question has gone around, because Microsoft will say something is supportable or supported, if it has been tested in that configuration, and SMS currently has not been tested with a storage area network configuration; however, there are customers doing it. It's very fast.
The support limit comes down to, what if you have a problem? What if there's some scenario that we haven't tested that causes a problem with SMS? At that point, our QFE (quick fix engineering) folks would say, that is an unsupported configuration, in that it's untested.
That does not mean that you can't submit an incident to Product Support and ask for a request for a design change request to handle whatever the problem is. It's no guarantee that that design change request will be accomplished, but that's what we mean by limited support or perhaps unsupported, that it's untested; therefore, we cannot state to you unequivocally that you will not have problems; however, the customers are doing it.
Jason: How safe is it to use different applications, such as database providers, to query the SMS database instead of the SMS Admin console? WBEM inquiries are much slower than the SQL queries.
Bruce: If you're doing selects only, if you're only reading data, you're not going to hurt anything; however, we really, really want you to use the views that were provided as of SP2. The safety factor is that when you start querying SQL directly, as far as selecting data, you lose the ability to set specific security that the WMI provider gives you. The second thing that you lose is the ability to set specific security; however, there's no reason why you can't use the views, why you can't report from the views. In fact, we encourage you to do so, if the security is not a factor for you.
I remember what was the second thing I was going to say was. The real danger is you'll put hours and hours and hours of work into reporting off of the schema that is versioned. In other words, the way the schema exists right now in the database, is directly related to the particular hot fix version and service pack version of the database. If you invest hours and hours trying to report off of the SMS database and we change the schema in the next service pack or the next hotfix version of a particular component like Dataloader or the Hardware Inventory Agent for example, you're going to lose all that work. We don't want you to have to suffer that. We would much rather you use the views.
Jason: After we delete a site in SMS, why doesn't SMS automatically do cleanup steps like deleting all associated jobs that are pummeling resources trying to find the missing site?
Bruce: I think what I would suggest at that point is there is an e-mail alias that Development monitors called smswish@microsoft.com. Development really does monitor that alias and likes that kind of feedback, and I would strongly encourage you to be very specific in your feedback and say, "This is what I'm seeing. This is what I think the product should do. Would you folks please consider this for the next version of the product?"
Wally: Another point on that is actually SMS does try to do that. If you look at Hierarchy Manager, and when Hierarchy Manager's log file says, "this site is detaching," it actually logs a statement in there that it's trying to clean up all jobs pending to that child site. So we do try to do that. We just don't do a very good job of it unfortunately, and that's why you have to go out and use the deljob, the preinst /deljob command that Bruce mentioned earlier in his presentation. So we do try to do some cleanup after it. We just don't do a very good job right now, and they do know about it. Like Bruce mentioned, you can go ahead and e-mail that in to smswish@microsoft.com, but the development staff and program managers are well aware of the fact that we don't do a good job with that. So it's something they're trying to work on.
Jason: Okay. In the case study results, you mentioned moved logs. Which logs did you move, the SQL logs?
Bruce: Yes. You have the database and you have the database logs. In that the logs are where transactions are written and then those transactions are then, when you have a BEGIN TRANSACTION, the transactional statements themselves, and then an END TRANSACTION, and then that transaction is written to the database; a lot of activity is writing to that log. If you move that log onto another device, it separates the disk access contention. So the logs are a high-reward move item. So yes, it was the logs for the SQL database.
Jason: The SMS Client causes our PCs to lock up. Is this just a client issue, or will site server settings have an effect? What can be checked to eliminate this?
Bruce: First, we would need to figure out which service, which Client Agent was causing the lockup. Quickly, I'll suggest that if you can reproduce this in a lab, start eliminating optional Client Agent components until you find what's causing the lockup, and then address that log. If it goes beyond that, I would strongly suggest you call PSS to do more specific troubleshooting. There's not enough information presented to know exactly what's locking up at this point.
Wally: One other point on that before Jason moves us on is that, I've seen cases and heard of a lot of cases where there are actually third-party products that are integrating with, and it's actually their product that's causing a lockup. I know some of them, I won't mention any names, but some of them have actually come out with patches to their product to integrate with SMS so that it eliminates the lockup problem.
So not only just like Bruce mentioned, you can look at the SMS Client components and see which one is causing a conflict. If it is ours, you also need to look at any other the third-party products you have installed, access drivers or other type of things that might be installed for networks and so on that, in conjunction with those and SMS, causes a problem. In a number of cases I've seen and heard of through PSS, it's actually the third-party products. So you go up to their Web site, and they'll have a patch version or an update that you can apply that will go ahead and solve that and allow the interoperability between that third-party product and the SMS Client.
Bruce: One other thing I thought of also was that, if you have a client that has been imaged and the image contains a back-revved (earlier) version of some SMS Client files, and those Client files get confounded with current client files, they'll cause the components to be out of rev with each other. The way to clean that up is to, as of SP2, you can run 20cliclean /scrub and also search the client for Clicore.exe, make sure there's no version of Clicore.exe that somebody saved off to the system directory or something like that that is causing problems.
Jason: Okay. We are down to less than 10 minutes remaining as far as what I can expand the WebCast to run for. So I'm going to ask that we don't get any more questions. I won't be able to take them.
The next question, Are there any known issues with rebuilding logon points after the PDC logon point has been rebuilt? We're getting a delay of two days. How can we avoid this?
Bruce: You're getting a delay of two days rebuilding the logon point after the initial ...
Jason: After the PDC logon point has been rebuilt.
Bruce: I'm going to refer to PSS, other than one thing. If you have set your schedule, and I don't remember where it is in the interface, but if you have set your schedule for your logon points to an impossibly short amount of time, what will happen is that you will begin to reconstruct your logon points, and you will run out of time before the next scheduled logon point reconstruction occurs. So you'll only get the first group of servers getting re-created, until eventually you'll work your way through the whole list. Now, customers sometimes get impatient thinking, if I just turn up the frequency, I'll get my logon points built faster, and that's exactly the opposite of what is true. You need to turn down your frequency to let that entire logon point creation cycle go through before it starts over again. You can look at the NT Logon log and the Site Component Manager log to see the activities and to see whether or not the process is starting over again before it has a chance to finish the first time.
Wally: Alternately, if your schedule's set to either the Windows Networking Logon Discovery or the Windows Networking Client Installation properties on the Logon Setting tab, that's where Bruce is looking for it, if that is set too long, because if you already have the logon points set up and now you rebuild your PDC, SMS doesn't know that, and we're not going to go check that until the next time we're scheduled to do a logon point update. So it may be that you have to wait until that period of time.
The other point is to make sure you're on at least Service Pack 2 if not greater, because in Service Pack 1 and RTM version, that process was single-threaded. So, as Bruce mentioned, you could time out and not be able to get to all your other logon points, whereas in Service Pack 2 and 3, it defaults to five threads, and that can be customized to give you more threads to rebuild your logon points in a more timely fashion.
Bruce: Two more tricks there, to start an immediate installation of the logon points. There's a utility called Kicklsm (Kicklsm.exe) that's in the latest version of the support tools up on the Web site. There's also, if you change the schedule in any way, shape, or form, it will kick off the process. So let's say you have it scheduled for once a day. You can change that schedule to any other value, and it should kick it off immediately.
Jason: Should RAID be used instead of a single disk for SMS SQL in a small site?
Bruce: If you're not have a performance problem, I would say you don't need to do that. The SMS Resource Guide and that white paper on sizing SMS site servers are pretty good about telling you what kind of disk subsystem you need for each kind of site. I'll just refer you to those resources because they tell you, if you have this many clients inventorying this frequently, you should have this kind of disk subsystem. So I'm going to refer you to those resources, but on a small site, generally no, it's not necessary depending on what you mean by small site again too.
Jason: Is there a single Q article that explains the functions and different SMS file extensions?
Bruce: No, that's in the appendix of, it's either the administrator's guide or the resource guide, and it's on one of the slides.
Jason: Okay. If we move services off the site server and onto other servers, are there easy ways to check if poor communication is contributing to a problem?
Bruce: The Status System will tell you the health of the communication between the site server and the remote site system. If you suspect network problems are causing delays and so on in communication between the site systems, and you need to get more granular detail, there is a resource kit tool for turning on NAL Logging and turning off NAL Logging. (NAL refers the network access layer.) What that will do is add into your logs, logging entries regarding the network connection attempts, the accounts used, I think it has the net APIs used on occasion, and any errors that would be returned.
So first, check the Status System. If the Status System doesn't give you sufficient information, check the log. If the log doesn't give you sufficient information, turn on NAL Logging. If that still does not give you sufficient information, try manually connecting between the site systems using the SMS Service Account or the Site Connection Account. If that still doesn't give you the error message or code that you need, then you're going to need to do a trace or call PSS.
Wally: One comment on enabling NAL Logging is that, when you enable NAL Logging, it greatly increases the size of your log files in a very quick fashion. So, you want to be aware of the fact that you may be losing other valuable information that you want for other processing because of enabling NAL Logging, because it is going to roll those log files over very quickly. So, you want to turn it on, do your examination, and turn NAL Logging back off. Those are two registry scripts in the resource kit to enable NAL Logging and to disable it again.
Bruce: You can turn the size of that log up in the SMS Service Manager. Turn it back down when you're done.
Jason: Okay. So we did not get to the last question, and the question is, I don't have the time to learn SQL. Are there any prefabricated Web reports available for SMS 2.0 from Microsoft? We will follow up with that user offline after the broadcast. The answer to that question is included below.
Follow-up answer: There is a link to it from http://www.microsoft.com/smsmgmt/. From there, look for "Systems Management Server Web Reporting Tool," which takes you to http://www.microsoft.com/downloads/release.asp?ReleaseID=28039, where you can download the actual code.
I want to thank everyone for joining us, and we hope to see you in the near future. Thank you and good-bye.
|