RSS

Monthly Archives: February 2012

10-289

I recently achieved something I never had before. I have read a book over 1000 pages, 1168 to be exact.

It was an amazing struggle of capitalism vs. socialism set in the 1950’s. Atlas Shrugged by Ayn Rand is a very fictional, fear mongering account of what life might be like if extreme socialistic views swept the United States. There was a poorly reviewed movie released last year that covered only the first half of the book. I would highly recommend this book to anyone with some free time.

10-289 is a directive issued by the socialist regime. They have begun to take serious control and have their way with whoever they want in the second half of the book. The leaders issue this directive 10-289 that basically locks everyone in their current jobs and allows no price or wage changes. Keep in mind this comes after the “no-dog-eat-dog” directive that governs big business.

Picture an airplane with 4 engines and 3 are currently on fire. All the pilot is allowed to do is stay the course until the plane eventually crashes because someone, safely on the ground, told them that’s what they are supposed to do. If the pilot or engineers try to fix anything they will be thrown from the plane to plummet to their certain death. There comes a time in this book were the engineers not only loose their ability to argue valid points but don’t even question the letter of the law.

Back to Tech

The reason I told you this is because it so amazingly relates back to my work. No one wants production problems, but sometimes they happen. Weather it be human error, poor judgement or simply computer neglect, the consequences can add up quickly. DBAs are stricken especially bad by the black cloud of doom that is change.

http://dilbert.com/strips/comic/2009-11-01/

How do DBAs turn into this waste of space? Well, its things like 10-289.

To put it generically as possible, this recent outage we had at my shop was epic. At least two decades have gone by without the severity of set-backs we’ve had in the last week. That caused management to enter a state of paranoia and fear of the unknown. Management tiptoed into the abyss and started, at first, whispering the directives. But then they got louder and louder until the goals were set of no changes.

Two fundamental skills that make DBAs good are an effort to learn and judgement. Without the ability to make decisions the learning is useless, and without learning the judgments are terrible. So when the socialist regime issued its directive of no changes, we were trapped in a loop of learning without purpose. The desire to learn simmered on a stove without power to keep it going.

With experience comes a growing knowledge of internals and the interlinking of production systems. This knowledge helps us DBAs avoid peak hours changes that could cause problems. This knowledge helps us understand what is wrong so we can quickly diffuse problems and put out that engine fire. This knowledge is invaluable.

So knowledge is the driver for good decisions. The responsibility of decision making drives us to learn. Its a real A=A logical argument. Good knowledge = Good decisions. The moment all decisions get wrapped in an approval process, the will to learn goes by the wayside. So I say to you, Power is Knowledge.

I present this argument against all that is stifling your power. Don’t let blanket regulations get you down, go forth and use your decision making abilities. Listen to your management but remember Sturgeon’s Law is always at play.

Advertisements
 
Leave a comment

Posted by on February 28, 2012 in Network Admin, SQL Admin

 

Server Naming Convention(s)

I learned a new word in a recent and quite heated discussion about server naming conventions. That word is Camel Casing. I know of the concept and actually prefer CamelCasingForServerNamingConventions.

Camels aside, I learned quite a few things about the history of our servers from a discussion we server admins had at my shop. What it boiled down to is there are only a couple guidelines we should and do follow but the rest of the blanket regulations don’t have much benefit.

One of the comments that I heard was, why bother because our naming conventions only last 6 months. This is terribly true, historically speaking.

To get to the point where everyone can agree, or at least agree to disagree, we need to have a good conversation. It would be easy to come up with a naming convention and force it upon the lesser, non-decision making folks. This method, regardless of your experience will not provide the best naming convention. So I suggest getting the opinions of everyone involved.

Government bodies often debate agenda items that seem like a huge waste of time. That is just how government works. I agree with the average person who says government is inefficient. But some of those inefficiencies generate long standing rules and regulations. That way they are not reworking something as simple as a naming convention every 6 months. Technology changes faster than social norms so the comparison is a stretch. However, a naming convention is a healthy team building (or breaking) exercise to attack.

Now, first thing on the agenda for a naming convention for servers is who to invite. Devs? NO, at least not yet, maybe later… DBAs? well maybe… or maybe not. Server Admins? Yes, these folks should be the first people involved in discussing the server naming convention. To be more specific, I suggest a verbal brainstorming effort to identify all considerations. Brainstorming does not include negativity. One thought that is interesting is the difference between verbal and written communication. In my shop, all of our written communication mediums are tied directly to our name. If we had a medium that was not directly tied to our name brainstorming in written communication could be more effective. So for now, verbal less-documented communication generates more ideas than a written discussion because when most people are asked to write down their thoughts, and they know they will be accountable, they are less likely to submit bad or mediocre ideas.

So go through the brainstorming phase which will probably produce something like this:

1. 15 chars or less, because of stupid apps that force netBOIS name resolution
2. put environment (prod or test) in the name
3. 8 chars or less because the mainframe can’t count very high
4. 3 character environment names PRD, TST, SND, DEV, QAS
5. Environment names should go first
6. 1 character environment names because that leaves 7 chars for mainframe apps
7. fixed length names
8. should always end in a number (1, 2, 3…)
9. last three characters should be a number (001, 002)
10. ALLCAPSFORSVRNMS
11. CamelCaseForSrvNms
12. Ambiguity for security (names of stars, xmen characters)
13. The number at the end should count up for every server.
14. The number at the end should start at 1 for each environment
15. Database servers should have SQL in the name
16. servers should have their DR tier in the name
17. web servers should have “web” or “iis” in the name
18. non-web application servers should have “app” in the name
19. servers should never have “srv” “svr” “server” in the name because its redundant
20. servers part of a cluster should have “c” in the name
21. servers should have the building they are in in the name
22. servers should have the floor that they are on in the name
23. physical servers should have a “p” in the name
24. virtual servers should have “vm” in the name

So this could drone on and on for a long time but I think I have covered the basics. Once your team has all had their say its time to start grouping contradictory items. At this point it would be a good idea to bring in some outside viewpoints from development and the business users who may have to type in the server name. (ahem, dns pointers) One good point is developers have this naming convention conversation about their code all the time. Sometimes its even more important for them to have a good naming convention.

Then its time to start voting like a good democracy. Each item should be allowed a yes or no vote to gather statistics. Also, if you say yes or no include a reasoning behind each vote. If you get any items with 20 nos and 1 yes you can throw those out. Then reconvene to have a discussion about the results and to get closer to a naming convention. Make sure to have some time in between meetings so that bad ideas can work themselves out and good ideas can iron themselves in.

In this meeting after the brainstorming meeting be considerate to others and start to trim down your list of requirements. Some people may get upset that things aren’t going their way and run to their blog and create a large post. These people should be thankful for their opportunity to participate in the discussion.

Next, on the hot button issues, do some pros/cons analysis. I think that the ambiguity option will quickly be cut because although it does have a solid pro, there are far too many costs associated with that option. Take a practical or even imaginary application environment and test out your newly designed convention. Consider resent developments in technology and future growth that may affect your naming convention. Take into account programming constraints such as “-” in sql server names that require [ ].

Finally, when you make a decision, stay the course. Unless its a bad course :]

 
Leave a comment

Posted by on February 13, 2012 in Network Admin, SQL Admin

 

Baseline- something to do while things are working

This is a task that is hard to muster effort to complete. I am talking about building a valuable baseline. A baseline is a detailed picture of an application environment when it is working well and has a normal user load. Without users everything works just great so that is a very key aspect to a good baseline. A baseline can also be performance test results from a system but that is only useful if you are allow to re-run that test.

Don’t rely on user reported data for a baseline. Users are all different but generally they cannot detect a 3x performance gain let alone a 40% performance gain. These survey type inaccuracies can be exasperated by a slowly changing system.

When I picture a good SQL baseline, I picture the data I would collect if I was having a problem. The big 4 [CPU][MEM][DISK][NET] performance metrics are key. The sql waits over a period of time such as a business day are essential. I would like to dive a bit deeper in this post. There more complicated problems that a good baseline can assist with.

You aren’t going to go from 0 to fixed in 3 seconds flat with a good baseline. It isn’t that easy. What you can do is look at a broken system, look at the baseline and then realize the differences that you should be the focus of your attention.

This baseline vs. broken analysis should come after you have identified what changed. That is a very annoying question to the people that make changes. People who make changes, SAN admins, DBAs, VMWare admins dislike that question because they hear is a lot. Its not so much a dislike of the question, its a dislike of the competence of the person asking it. Its a completely valid question but when you have a laundry list of changes… it can very much complicate and drag out the problem solving. Most changes can be [ctrl+z]’d but we have to understand why this changed happened. If we smash [ctrl+z] in a panic we might be in the same boat a month from now. If we undo the change without figuring out what is wrong, it could effectively issue a DNR order on that product. It creates a fear that change is bad. Before you ask “what changed?” remember that time itself changes.

I personally like to sit in the “what changed” camp for a while, but if that doesn’t fix the problem you have to switch the question back to what’s wrong. A good baseline can help answer both of those questions. Like I mentioned, a good baseline will have pictures of all the areas that you go when troubleshooting. Even a simple test like, ping, can save you time when troubleshooting. This will prevent the OMG I can’t ping it, call the network team response when actually 2008 R2 disables ping (ICMP echo request) by default. Also, under network pressure, windows will choose ICMP to drop first.

To get more specific, here are some things that I dream of in my baseline:

1. pathping results during peak usage from several areas
2. network trace
3. sql trace
4. graphs of ready time vs. cpu utilization
5. memory usage and allocation
6. disk latency and throughput
7. windows event log
8. application log
9. host log
10. sql server and db configuration
11. Full backup times

You think that’s a lot of data? Well that’s what it takes to solve complex problems. It takes a lot of work to really solve problems by figuring out the root cause. The real root cause, not just lupus. Not just the fact that you changed something, its the reason the change didn’t work.

A personal story

I’d like to close with a story of home ownership and stupidity. I’ve lived in the same house for over 4 years now and from time to time there has been a spot that you can stand and hear a vibration sound. When no-one is standing there the vibration is gone. This makes it rather hard to troubleshoot but I assumed (correctly) that it was a vent making contact with the floorboard.

Several times this would irritate me beyond a reasonable level and I would march downstairs and bang at the nails holding the vents in place until I thought it went away. I would then march back upstairs in Homer Simpson fashion and step on the same spot and hear the vibration again, driving me insane.

The furnace and the furnace fan is very old so I, without proper baseline, thought this vibration was normal. I must have replaced the furnace filter 10 times and decided that maybe this time I will buy a better filter. Still, with a new filter the vibration continued. Within the last month I started noticing that my dog would wake up hungry and pace around the bedroom causing the vibration intermittently. This compounded frustrations because my dog should not know that it is 5:30am on the dot and should not pacing in circles like some sort of zoo animal.

I marched downstairs determined to figure out how to replace this furnace fan that was making this terrible vibration. thumpthumpwumpthumpthumpthumpwumpthumpthumpwump…. gah! I walked around and felt the vents because the furnace was running at the time. There was a cold side and a hot side right next to the furnace. I opened up the furnace and removed the filter and looked at it. I thought… hrm… thats odd, the dust is on the wrong side! DOH! I put the new, new furnace filter with the nicely labeled [AIR—>FLOW] pointing the other way. It makes sense now that the cold air goes in and the hot air comes out…duh.

When my wife got home later I bragged to her my victory. She got mad. She explained to me she said something about that already which I thought I didn’t remember. I posed a fairly good argument that if she reeeeealy thought that was it, why didn’t she change it. So she stayed mad for a day and then the next day told me the story of exactly what I said about 6 months ago when we were talking while I was putting the first new filter in.

The furnace fan was running and you were replacing the filter. Dustin, I said to you, “Are you sure that’s right?” And you replied, “O yeeeeea, I stuck my hand in there and felt the air”.

This caused and immediate rush of humility because I did remember saying that and it made me feel downright stupid. We laughed historically for a while. I admitted I was completely and totally wrong because:
A. I didn’t have a good baseline and blamed the fan that has been working for years
B. I “stuck my hand in there” while the thing was turned on thinking that was a good test for air flow in a vacuum
C. Was arrogant and discounted her when she asked “Are you sure?”

 
1 Comment

Posted by on February 3, 2012 in Network Admin, SQL Admin