RSS

#sqlsat552 Lincoln, NE recap

Before you call me crazy SQL guy (I think that name is already taken) I had several motives for driving 10 hours to Nebraska. In no particular order:

  • My wife and I work in different states, Lincoln happens to be on the way.
  • We like nature, this one was in a State Park
  • There was a cabin available for 3 nights and at a decent rate that allowed dogs
  • SQL Saturdays are awesome
  • Free sessions and networking
  • Free volunteering (sounds weird but if you have done it, you understand)
  • I’m considering coordinating my own SQL Saturday

However, with every opportunity there is a cost

  • All-in I probably spent $700 I would not have spent otherwise
  • I missed some football games (no big deal)
  • I missed opening day of hunting season (kind of a big deal)

Here are some of my notes from the sessions I attended

Getting Started with Extended Events – Andy Galbraith
Switching from a working solution to something new when you don’t necessarily have to can cause some anxiety. This anxiety causes paralysis. Sometimes we all need a little nudge to really get going.
I was a little confused about the statement that extended events fire synchronously which causes them to be lightweight. It is a little bit of a contradiction I still have to sort out but I think I understand that some parts are always running, and you can asynchronously choose to record them.
The key bits are packages which contain events, actions which is the info you track, and predicates which filters like a where clause
The event file is a .xel file and is least impactful one of a few ways to save off the information
System_health is like the default trace and always running, don’t modify this.
Use caution which what events you pick and where you store the events, query plans can cause a lot of overhead and storing in the buffer can bloat in memory
Set state=start to make the XE work

Get Your Optimizer to Give up All Its Secrets – Brian Hansen
Brian had a nice organized flow which made me compare his style to more of a college professor
Logical order of operation is important and starts with From to ON
Intellisense doesn’t work until you have FROM
He pointed out some confusion with select meaning something different in Algebra(it’s a filter). In Algebra a projection is like a select
The query tree is the output of the algebrizer and he had a nifty app to visual query trees
Distinct is like a group by
Contant scans are usually good because they can help avoid data access – ex: where clause A>50 and A<50

Tips and Tricks for Performance Troublshooting – Phillip Podhrasdsky
Phillip had a faster pace and covered a lot of ground for an intro session.
Query Plans are key along with statistics. Stats have 200 levels, so you might not get stats for every item in a “colors” table. Creating additional filtered stats can help with this issue

Query Plan Deep Dive – Kevin Boles
Kevin is a great speaker and had a well-organized slide deck.
Ref: Paul White – cross apply blog post
Spot Nested loop joins can be inefficient if a large data pipe is on top
The optimizer will sometimes choose to sort the data to do some joins
Merge join works well if there is order to both tables
Hash joins – the table is built in memory and will spill to tempdb, works well if not ordered
Stream aggregates require order (non-blocking)
Fast first – can cause more overall work, but create a better user experience for poorly tuned query
Asserts- check things like True or False conditions (ex: finding a key to delete)
Pick fill factor intelligently
Tsqlt.org
ADO.net timeout is 60sec ish
Optimize for unknown or recompile are options if you run into sniffing issues
UDFs will generate warnings in the query plans and cause some bad estimates

Dammit Jim! Dr McCoy’s Field Guide to system_health (and the default trace) – Ed Leighton-Dick
Finding answers to “who done it” with the things that record data by default
System_health is XE and trace if profiler – they both get trimmed after a while

Afterwards we met up at one of the Peter Kiewit Lodge cabins for some drinks and food. I talked to the organizer a bit and got some great information from volunteering and the after party to help me consider organizing my own SQLSaturday.
I had a great time and look forward to attending more SQLSaturday events next year.

 

 
Leave a comment

Posted by on December 3, 2016 in Uncategorized

 

PASS Summit 2016 Recap

Introduction

I have a fairly massive bundle of lanyards in my drawer at work. vmworld, ciscolive, emcworld, sql connections, sql rally, grr con, bsides detroit, a bunch of sql saturdays and more. All are great learning experiences and if I didn’t have to do actual work, I’d love to attend more conferences. I definitely plan to attend the PASS Summit next year and many years to come.

All employers should commit to a couple weeks of training for all associates. Most new people can learn from the experience of more senior members of their team but both the trainer and trainee need to be available for this to work. Classroom training from companies like New Horizons can come in handy for those associates who don’t like to travel or for companies that don’t have a good mentor in that field. Conferences that require travel can sometimes cost 3-4k but with the right person, are totally worth it. In fact, sending multiple associates to the same conference gives them a chance for teambuilding somewhat outside the stuffy bounds of the office.

Being a PASS chapter leader grants me free admission to the 3 main conference days. That got my reported expenses down to 2k, which the majority of that being flight, hotel and cabs. Chapter leader is a time consuming job, but very rewarding in many ways including this free ticket.

The Summit stands out from all other conferences I’ve been to for a couple reasons. First, the content is mostly not vendor content and more relevant to the “how” part of my job. Tips and tricks instead of products are explained in many sessions. We as attendees are mostly not getting sold a particular product because a majority of the speakers are community speakers. Sure, definitely lots of fans of SQL Server and not much mention of Oracle/MySQL/Postgresql/DB2/etc.. but I think you get the point. Second, it has a feeling of a family reunion for a lot of people. We have that crazy uncle, annoying younger brother, CEO cousin, and a poor bunch of Cleveland Indian fans. Networking is hard but the most beneficial aspect of the conference. There seems to be a greater number of people willing to meet new people at PASS than any other conference I’ve been to. There are also many other PASS events that give you an opportunity to reconnect with these people.

Without further ado, here is my play-by-play.

Monday

Alaskan Airlines (23” rule is the largest free carry on that I am aware of)

I may have liked to attend a pre-con this day but I would have had to fly out Sunday.

Tuesday

Registration

Chapter Leader Meeting

I didn’t know this last year but this is a mandatory meeting for chapter leaders who get the free Summit ticket. Lots of good ideas discussed for my local chapter (http://glass.sqlpass.org) in regards to sponsorship, speakers, and locations.

Welcome Reception

A single drink ticket was provided and cash bar after that. Light food and very tasty. Great chance to network.

Wednesday

Keynote

Discussed HTAP, hybrid transaction analytical processing. Basically the reason we do ETL is to avoid too much stress on the primary system. Some of this can go away with multiple object types that store the same data in different ways. In-Memory OLTP table with a Cluster Column Store Index can serve both reports and transactions at the same time. This is interesting, but the reason CCSI was read only in version 2014 was because updates to this type of structure are slow, updates can happen in version 2016, but I’m not sure how fast they are.

Microsoft R can be installed on SQL Server or on a standalone server. They have promised huge performance gains through multithreading and scale out options. Not sure if there is a licensing benefit for us to go standalone, built into SQL, or just keep using open source R.

Azure IO bandwidth is 4x any of the other cloud competitors. SQL Server running on linux was demoed, even works with AlwaysOn availability groups. Isn’t that appealing to me and won’t be GA until later 2017.

Architecting SQL Server for Mission Critical and High Performance (HPE Vendor Session)

Discussed some superdome hardware architecture. One interesting trend is using NVDIMMs. This non-volatile memory can be used to harden log records. DAX bypasses the storage stack, but can still be configured as a hard drive in windows. SQL can take advantage of this for a secondary small log file with a trace flag and eliminate LOGWRITE waits. The size (8GB) and availability of these devices is prohibitive.

Inside SQL Server In-Memory OLTP

Bob Ward has a fast paced, in-depth speaking style that I like. I attended a couple of his sessions and he made his slides available at http://aka.ms/bobwardms

The goal of In-Memory OLTP was to be 100x faster that previous versions. It sounds like they got about 30x faster, mostly due to the removal of latching. The important durability differences between an In-Memory table and a regular engine table was discussed. No UNDO process is required for durability because there is no concept of write ahead logging. These transactions are just dropped on failure. 701 is the error we will get if we don’t size memory properly to the table we want to enable for In-Memory functionality. There are management objects, but they don’t show up in the database system catalog since they are in-memory tables and recording that information in regular tables would slow transactions down. There are two main types of indexes, hash and range. Bucket sizes need to be determined for hash indexes but they are much faster if configured properly. XTP_OFFLINE_CKPT is a wait that is reading the transaction log and making decisions to flush to disk. This isn’t a wait that affects transactions but runs asynchronously.

Lightning Talks

Vendor Reception

Nimble Storage and Pure Storage were there, not sure if any other storage vendor was. Talked to AWS about some of the things they do. Providing a database as a service separate from a database server as a service might be worth us looking into. VMware was there and working the booth was one of the speakers from EMCworld and CiscoLive. I got some information about some of the max VM sizes in the wild and found out it might be worth trying to put Actuarial SQL Servers as a single VM on a host.

Thursday

Keynote

It was refreshing to see several options in cloud data warehousing presented. Dr. David Dewitt compared a roll your own DW with Amazon’s redshift and Microsoft’s SQL DW and Snowflake. He talked about Oracle RAQ and Teradata. I liked the comparison of different vendors with a fairly open minded approach at a conference that focuses on one vendor. He explained some of the pros and cons of any scale out data platform.

SQL 2016: It just runs faster

Clock speed hasn’t improved much in 10 years but the number of cores per processor has gotten really high. Several pieces of the SQL engine needed to be re-written to process more in parallel. Parallel redo, parallel insert select, automatic scaling of number of threads and more partitioning allows for more parallel operations in 2016. We need to work on our ISVs to support later versions of SQL Server if we want to improve overall user experience.

Recovery Models Made Simple

Great session that explains recovery and a lot about the SQL Server log file. Active transactions that affect space used in the log file was explain very well. Virtual log files and their reuse was explained. Fn_dblog(null,null) can be used to explore more information stored in the log file.

Monitoring and Alerting of Availability Groups

Picked up a lot of useful tips and tricks for the couple AGs we have in production. The SSMS dashboard is pretty useful for monitoring performance. Reviewed the basic steps of an AG transaction. Overall makes me more comfortable with 2016 AGs even though we try to avoid them since it is a copy of the data and a windows cluster on top of a vmware cluster.

Vendor Session: Storage Testing

Learned about a fairly new tool VD bench which can simulate more realistic virtual workloads. With dedupe&compression inline, synthetic tools do not provide a while lot of useful information. Argenis and David and awesome speakers.

Microsoft R Sessions

I attended a SQLCAT session and introduction session to try to get a handle on how best to manage R for Actuarial. The open source R doesn’t operate in parallel and Microsoft has created some open source ScaleR functions (rx*) that are comparable but operate in parallel. I saw an interesting demo of predicting the date that a database is going to run out of space with R. That might be a good way to practice with usage of R that is built into SQL Server. There wasn’t much around memory management with R built-in other than I learned it can spawn a lot of extra processes outside of the SQL process.

 
Leave a comment

Posted by on November 21, 2016 in Uncategorized

 

That Time I Made SQL Server Too Fast

Things break. We don’t usually intend them too, they just do. Sometimes, maybe we do intend them to break. I’m still suspicious about cars breaking on purpose so we will buy a new one. Other times, they break in ways we don’t expect, like these times I made queries respond too fast.

For example, several years ago I was a fairly new DBA. I wasn’t used to the enterprise class roasting one gets when things are not quite right. We had a multi tier application, DB, app server with a desktop application that started to respond with errors after a hardware upgrade of the SQL Server.

Darn. We tested everything we could. We reviewed the code and sent it through upgrade analyzer. We restored production backups and tested those on the new hardware.

But it wasn’t enough load for the system to break in a peculiar way. The server had a limited number of threads to deal with a response from SQL. Seriously, hard-coded value of 3 thread to go to SQL, get a visual basic script stored in the database, and execute that on the app server. I kid you not, sometimes the sausage factory isn’t worth the tour.

This dynamic script was now loading up these 3 threads all day, and it was a lengthy script so sometimes it would hit the other hard coded timeout values and the users would start to get errors in large numbers instead of just waiting a little longer that usual.

We tried, and tried some more to work with the vendor. But in the end I had to make SQL slower so the app could keep up. We moved this SQL Server back to old hardware until we could also upgrade the app server.

The most recent time I made SQL too fast was a couple days ago. Another third party vendor who is XML happy. Surprisingly, the problem had nothing to do with XML.

Re-wind several years ago, I convinced a co-worker that SQL 2012 would be a great choice for their application even though the vendor didn’t technically support it yet. Who cares! Microsoft supports 2012, not this vendor. We tried and tried to get the thing to work but the application just cooperate with SQL 2012. Fortunately, we didn’t have any data so we just reinstalled the app and used SQL 2008R2. I conceded defeat, probably too soon and rolled it back after delaying their upgrade a couple weeks.

Back to a few days ago, that same co-worker fell for my tricks again with a different app. This time SQL 2016, because Bob Ward says it just runs faster ( http://www.sqlpass.org/summit/2016/Sessions/Details.aspx?sid=51194 ). Let me tell you how hilarious this error is as briefly as I can.

This XML happy application has a boatload of tables. Fairly normalized but lots of chatter with XML back and for the the application even with just a single user. We are testing a new version and a single user was getting a primary key violation. Consistently happens on SQL 2016 so I had him send it back to the vendor, even though he was threatening physical violence. Given our prior experience of jumping on a new version, I don’t blame him. No way could this be SQL 2016, primary keys are are similar across all versions right?

Vendor comes back with some confused email. “We can’t recreate the issue.” Blah blah, co-worker convinces me to give him access to create the database on a SQL 2012 instance and bam, no error. The beatings start. Even though he knows its bogus, still I have bruises and my fear of the new shiny is reinforced.

Are you ready for the shocking news? SQL 2016 is fast. Really fast. Even with absolute garbage design and code, its fast. That is how we got primary key violations.

The table we are working with has a primary key. But the interesting part is, it is a composite key meaning there are a few columns that create uniqueness between rows. So two of the columns can be identical across rows, as long as the third is different. Then we are unique.

Well, in an absolutely vomit enduing way, the vendor chose a datetime datatype to create this uniqueness. Well, when several rows come back in less than 3ms, even with a boatload of garbage XML to process, we no longer have uniqueness because datetime isn’t precise enough.

Not sure what they might consider a fix, but I explained to them what is happening. Hopefully, they don’t jump on the datetime2, knee jerk reason and go with an actually unique key.

 

 

 
Leave a comment

Posted by on November 21, 2016 in Uncategorized

 

vmworld 2016 keynote

I’ve been lucky enough to be able to attend vmworld once in 2013 but didn’t get to go this year. I decided to take advantage of the free live stream of the keynote on Monday.

Pat Gelsinger took the stage with 23,000 of “his closest friends” a.k.a vmworld attendees, and had a few interesting things to say. First, he no longer makes a distinction between digital and non-digital businesses. Truth is, I cannot think of a business that will grow into the next decade without embracing technology. They may not need VMware and the cloud but with the new generation seeking products and services exclusively online, every business will need a way to engage these new customers.

After that he shared some interesting data points:

– in 2006, only 2% of the cloud was considered public with most of that being salesforce.com
– in 2016, about 80% is still traditional IT, meaning the automated service deployment or future service based IT is still only 20%
– 2016 is the year that 50% of hardware is in some form of public cloud and shifting from private datacenters rapidly
– 2016 70% of workloads function in tradition IT and by 2021 that number will be down to 50%

Servers are becoming more powerful, and consolidation is improving. Pat doesn’t see this as a shrinking market but actually generating growth by making the cost of servers more accessible to the masses. He compared this to processor enhancements of the 90s and I agree with this assessment.

Enterprises are finding cases where the public clouds make a lot of sense but there will (always?) be lots of cases where the existing owned datacenter will have advantages and certain services will be provided in-house.

The most interesting announcement from the keynote was VMware’s cross-cloud platform. This was a higher level GUI for managing services on all the big public clouds, and private clouds. There was a demo where you could enter the login information for AWS and this platform would discover the applications hosts under this account and present them in a uniform fashion with other cloud services.

This service is interesting but it appears to only work with IBM’s cloud witch advertised “500” customers are on. That doesn’t seem like a lot of market penetration to me. If this cross-cloud platform doesn’t work with all the cloud vendors, it doesn’t make it very far in my opinion.

Another buzzword I picked up on was Automated Lifecycle Manager. This sounds like product that has reached its end of life (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2015661) There is an obvious need for this when the demand for services increase with the decrease of cost for cloud services. We have to automate the full lifecycle to prevent sprawl and complexity. I wonder if VMware is coming out with something new on this front and is why it was mentioned.

The guest speakers from IBM, Marriot and Michael Dell all seemed to fall flat to me. None had any really innovating or inspiring comments. I’m not sure who the speaker was but towards the end they started talking about service architectures. One comment I did like was choosing multiple cloud providers for a single service in case one cloud goes down. This has happened and even though it could cause a lot of complexity, could be something we would want to architect around.

When I attend conferences like this, I usually to find something more valuable to do with my time during the general sessions. In conclusion, the session wasn’t that bad. My view wasn’t totally negative like this post (http://diginomica.com/2016/08/29/vmworld-2016s-keynote-snoozer-is-no-worry-for-the-infrastructure-innovators/) but since my opportunity cost was low, I don’t regret tuning in.

 
Leave a comment

Posted by on August 30, 2016 in Virtual

 

MTC Chicago Visit

Microsoft has datacenters that are open to customers for training and consulting of all kinds. I’ve been to the newer one in Detroit and to the MTC Chicago twice now with the most recent time being this week. It was a good visit sponsored and encouraged by our storage vendor.

The key topic being ways to manage copies of data. We have some specific needs surrounding reporting, agile development, and quality assurance that request several copies of large databases. Thin provisioning, compression, and deduplication help control costs but several arrays do best when the copy is a snapshot done at the block level. This reduces the cost of having that copy and keeping it around. These users however, generally want to create the copy at the database level. This has a few complexities, even in a fairly standard virtual environment.

Some of these copies require the source system to be quiesced so the snapshot technology needs to integrate with your database technology to cause a pause in writes while the snapshot completes. SQL handles crash consistency very well, but if some require quiescing, I may as well setup a system that does that for all of them. The virtual layer has abstraction layers between what the guest operating system sees and what the storage system presents. This also needs to be coordinated. All of these separate scripts that use different APIs need to be scheduled and some needs to be able to be called from another application or run on demand.

This problem has been mostly solved by some of the automation I have done surrounding backup and restores. We have a working solution but it takes technical time and more storage than it needs. Whenever you have a solution that is good enough, it prevents you from creating a great solution. It is a good solution up to a point until the number of copies, and time it takes to make a copy become prohibitive.

It was a great trip to see another datacenter in action that has the up and coming technologies. It gave me motivation to work on the great solution instead of settling for one that barely works.

 
Leave a comment

Posted by on July 29, 2016 in SQL Admin, Storage

 

EMCWorld 2016 Recap

2016-05-20_20-34-08

This was my second time to EMC World and I enjoyed 2016 just as much at 2014. I ended up signing up for a test and am happy to report that I passed! Most all big conferences like this offer a free or reduce attempt at one of their exams and I chose the XtremIO Specialist for Storage Administrators. I prefer taking the exam first thing Monday. Sure there is a chance I could learn something during the week that might be on the exam but I think it is more valuable to be well rested and not have my mind cluttered with all the new knowledge. Once that was done I had time to get into the keynote.

Opening Keynote
2016-05-20_20-35-04

Seeing Joe Tucci on stage for possibly the last time was a bit like seeing John Chambers at Cisco Live the previous year. Although difference circumstances both crowds seem to respond the same way to seeing their respective leaders pass the torch. Micheal Dell took the stage and had a few interesting things to say

-Dell Technologies will be the new company name
-Dell EMC will be the enterprise name
-EMC is the best incubator of new technology
-Dell has the best global supply chain
-Both companies combine for 21 products in the Gartner Magic Quadrant

There were also some product announcements at EMC World. Unity, which is a mid-tier array that there is an all flash version for $20K. DSSD D5, no pricing here because if you have to ask, it is too expensive. This product addresses some of the IO stack issues and works with new “block drivers” and “Direct Memory APIs” to reduce latency [1]. If 10 million IOPS isn’t enough, cluster ability is coming soon. ScaleIO, Virtustream Storage Cloud, enterprise copy data management (eCDM) and the Virtual Edition of Data Domain were also announced.

RecoverPoint

When setting up my schedule I made sure to get all the interesting looking RecoverPoint sessions booked. Gen6 hardware is out so it is a product that has been around for a while… or has it? EMC didn’t make it easy for us when choosing a product name for RecoverPoint for VM (RPVM). RPA or RecoverPoint Appliance is separate from RPVM. RPVM uses a IO splitter within ESXi in order to provide a potential replacement for VMware’s Site Recovery Manager. I took the hands on lab for RPVM and found it to be pretty complex. It is nice to be able to pick and choose which VMs I can protect but sometimes I want to choose larger groups to reduce the maintenance. Maybe this is possible but it wasn’t very clear to me. My suspicion is array based replication will still be more efficient than host based replication options such as RPVM or vSphere Replication.

RPA has a very interesting development along the lines of DR. Since hearing about the XtremIO, I questioned how the writes would be able to replicate fast enough to achieve a decent RPO. RPA can now utilize the XtremIO snapshots in a continuous manner, diff them, and send only the unique blocks over the WAN. That makes things very efficient compared to other methods. Also, the target array will have the volumes that we can make accessible for testing using XtremIO virtual copies (more snapshots).

DataDomain, DDBoost and ProtectPoint

DataDomain’s virtual appliance announcement was interesting, but I’m not sure I have a specific use case yet. Mainly the need to backup a branch office might come into play but I would want a separate array to host that vmdk. ProtectPoint has volume level recovery features and SQL Server integration now. I can choose to backup a database who’s log and data files are on the same volume and then use the SSMS plugin to do a volume level restore. This grabs the bits from DataDomain and overlays them to the XtremIO using your storage network. I’m not sure how efficient this restore is since I just did the hands on lab but it is very appealing for our very large databases that tend to choke the IP stack when backing up.

DDBoost v3 is coming out in June. This release includes things like copy only backups, restore with verify only, AAG support, restore with recovery for log tails, and also restore compression. I know many DBAs have had a bad experience with DDBoost so far. I have avoided it but v3 might be worth a try.

Integrated Copy Data Management and AppSync

If you have two volumes on XtremIO and load them up with identical data one right after another, you will not see a perfect reduction of data. The inline deduplication rate (ballpark 2:1 – 4:1) will kick in and save you some space but not a lot. If you can implement a solution where you can present the volume of data that is pre-loaded to another host, XVC (writable copies) will save a ton of space. In one session they surveyed several large companies and they had roughly 8-12 copies of there main production database. Consider that being a 1TB database with 2:1 data reduction. That is .5TB used physical capacity plus the change rate between refreshes. Now in a traditional array like VMAX (no compression yet), that is up to 13TB used.

I think one of the goals of the AppSync software is to put the CDM tasks into the hands of the application owner. The storage administrator can setup the runbooks and then create and grant access to a button to do all the necessary steps for a refresh. It sounds like support for Windows clustered drives is in the works with other features being added soon as well.

Deep Dive Sessions

I attended a session titled DR with NSX and SRM. The speakers proposed a virtualized network solution that decouples DR from the physical network. No more L2 extension technology required. The cross vCenter NSX used Locale ID tags for each site to create local routes. The architecture even had some solutions for public website natting to the proper location. I hope the slides get posted because it was pretty deep for me to take in lecture form. The one thing I found fairly comical was the speaker mentioning OTV being expensive as a reason to look at NSX… maybe they have never seen a price for NSX.

The VMware validated reference design was a very good session for me. It validated a lot of our decisions and also got me thinking about a couple new tweaks. HW v11 can now scale to 128 cores and 4TB of RAM for a single VM. VMs are getting 99.3% efficient verses their hardware counterparts. Some hadoop architectures even perform better in a virtual environment. My notes look more like a checklist from this session:

-vSphere 6 re-wrote storage stack (I think for filter integration not necessarily perf)
-check vCenter server JVM sizing
-rightsize VMs
-size VM into pNUMA if possible
-don’t use vCPU hot-add (memory hot add is fine)
-hyperthreading is good
-trust guest & app memory suggestions more than esx counters
-use multiple scsi adapters
-pvscsi more efficient for sharing
-use Recieve Side Scaling in the Guest
-use large memory pages
-look at performance KPIs to determine if settings are beneficial (not just cpu%/ etc..)
-balooning is an early warning flag for paging
-go for a high number of sockets (wide) when in doubt over vsockets or vcores
-watch out for swapping during monthly windows patches
-co-stop is a sign the VM is hurting by having to many CPUs
-clock frequency is latency driver, especially for single threaded ops

In another session I attended was Deployment best practices for consolidating SQL Server & iCDM. There is a webinar the last Wednesday in June for scripting examples and demos.

There were some really good storage performance tips in this session:

-vSCSI adapter – disk queue depth adapter queue depth
-vmkernal admittance(disk.schednumreqoutstanding)
-physical hba – per path queue depth
-zoning multipathing
-shared datastore = shared queue -> disk.schednumreqoutstanding override lun queue depth
-separate tempdb for snapshot size purposes, tempdb is a lot of noise and change that isn’t needed
-don’t need to split data & logs anymore
-still create multiple data files

Conclusion

The EMC brands are evolving at a pace necessary to keep up with the next wave of enterprise requirements. I was happy to be a part of the conference and hope for a smooth acquisition.

2016-05-08_18-30-35

 
Leave a comment

Posted by on May 20, 2016 in Storage

 

vCenter Server Upgrade Finished

In my previous post I bypassed a fairly lengthy step we had in validating the compatibility matrix. VMware software has strict guidelines in what hardware and firmware levels will be supported. Layer in the 3rd party plug-ins and databases and other VMware software and you have something that resembles a pile of cubes rather than a single matrix. For the basics like esxi & vCenter, this page is helpful: https://www.vmware.com/resources/compatibility/sim/interop_matrix.php#interop&1=577&2=

After the previous list of gotchas, I ran into a few more gotchas. My main virtual center was asking for 24GB of freespace to upgrade from 5.5 to 6. Easy enough problem unless you have only 2 drives in a physical blade that are already fully partitioned. Some of the solutions we batted around were:

1. Get the hbas working and zone a lun, then swap the drive letters
2. backup and restore to a VM
3. install a clean 5.5 and point to an existing database
4. Use the VMware converter to P2V

We tried option 2 and failed. There ended up being some limitation of the software that we ran into. Option 4 worked out quite well. At first I was told it wasn’t possible becuase you need vCenter online for the converter to work. Turns out there is a workaround. The P2V only took an hour and I was able to re-size the partitions in the process. Two posts that were very helpful in this process were:

https://blog.shiplett.org/performing-a-p2v-on-a-physical-vcenter-server/

https://virtualnotions.wordpress.com/2015/04/23/how-to-p2v-of-vcenter-server/

5.5 came back online in virtual form on this isolated host fairly quickly. Then it was time for the upgrade.

After about 30 minutes of solid progress bar moving, it appeared to stall out. CPU was idle and the upgrade window showed a message like, “Starting Up vCenter Server…”

vcenter_stall

I got concerned, almost scrapped it and started over from my VM snapshot. I checked a bunch of log files and looked at the disk activity to see what files it was writing to. None of this really amounted to much of a lead. I looked at the windows services and vCenter was in the “started” state. I tried with the thick client to log in but it said I didn’t have access. It was at that point the upgrade appeared to take off again and completed without error. I guess it just needed a kick.

vcenter_fin

The update manager install was simple and uses the same install media. After that I only had one issue remaining. The overview performance charts were not showing up. This is by design in the thick client https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109800

However, the error I was getting in the Web Client was not by design. Adjusting some network parameters corrected the error I was recieving after a restart of the Performance Charts Service
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2131040

Overall this project was great experience. I have a better understanding of vCenter and learned what logs are important. I got some practice in disaster recovery (failed upgrades). I am also more comfortable with running virtualized vCenters. The plan is to move to the vCenter appliance at some point but I suspect that will come after we upgrade are hosts.

 
Leave a comment

Posted by on March 13, 2016 in Virtual

 

vCenter Server Upgrade Gotchas

We have a sandbox virtual center with a couple hosts in it to test things just like the 5.5 to 6 upgrade we were ready to start. This vCenter has been rebuilt from scratch a few times and doesn’t really have any complexity to it. So when we went to upgrade it, of course it went fine.

The next vCenter I tried to upgrade was at our DR location. Fortunately, I took a backup of the OS and database, because this time I ran into several issues.

administrator@vsphere.local password

This password had expired which is required for setup to continue. There is a database hack available that I have used to extend this timeout value inside the RSA database but it wasn’t working this time. The utility vdcadmintool.exe is documented here and quite easy to use to get a new password. It is just a cmd line utility that will spit out a random password. A great reason to lockdown who has Windows Administrator on your vCenter Server.

VIEW SERVER STATE permission on the database

There are only a few options to select and the install starts.

In previous versions, we have allowed the service account defined in the ODBC connect to have db_owner. This grants every permission inside the database but nothing at the server level. It turns out v6 requires a server level permission called VIEW SERVER STATE.

Here is another KB.

Rollback Gotchas

After this error was hit, a rollback process was started. Rollback doesn’t put 5.5 back in place at the filesystem level so you need more than just a database backup. Part of our operating system restore procedures require an ISO to be mounted. But since vcenter was down, I couldn’t mount that iso. I had to look in the database and find what host the vCenter VM was running on and connect directly to it with the thick client. There is a VPX_HOSTS view that makes this fairly simple to find what host to connect to.

The restore process also requires us to add a NIC but distributed switches were not available to select in the dropdown. I had to create a standard switch on this host and assign that to the VM so vlan tagging could happen.

After the OS restore and database restore I was able to connect to vcenter 5.5.

Inventory service broken

The next time I tried the install I wasn’t able to start it. There was an error complaining about the inventory service. I checked this by trying to search for a VM in the thick client and, sure enough, it was broken. I’m guessing this was due to the restore but a restart didn’t seem to fix it. I went searching and found another KB to reset the inventory service. There is a handy powershell script at the bottom to speed this lengthy process along.

Logon failure: unknown user name or bad password.

After the restore, I did have to reset the administrator account again. I got a strange password that started with a space, but it worked in the web console so I tried again. The next go at the install died with this message:

Error 1326 while creating SSO group “ComponentManager.Administrators”:dir-cli failed. Error 1326: Logon failure: unknown user name or bad password.

The error had a very strange ” ,” with extra line breaks around it. There seemed to be a parsing error. This error left my vcenter a steaming pile, so I applied the backups and tried again with a new administrator password. I wasn’t able to confirm but I am pretty sure I got really unlucky and that space at the beginning of the password caused an install fail. No KBs for that one.

Success… Almost!

This upgrade can take around 30 minutes so I was very delighted to finally see a successful message at the end. I was able to log into the thick client and see VMs. However, my web client was giving me a permission problem even when I was logged in as the administrator@vsphere.local account.

You do not have permissions to view this object or this object does not exist

I ended up calling support on this one and they showed me to a registry hack. I’m not sure how this happens but an important service account registry key for a path can get overwritten.

Good Luck!

Hopefully this list helps save someone some grief. vCenter is a complex product with a lot of interconnected services. I’m not terribly unhappy with my upgrade experience. I probably would have had a better time if I had read through all of the best practices. Even though it doesn’t run on MS SQL, I’ll be seriously considering migrating to the appliance version of vCenter after we get completely upgraded.

 
Leave a comment

Posted by on February 28, 2016 in Virtual

 

Learning about an old join syntax, the hard way

Today I discovered that there is an interesting way to join tables

select c.name, o.order_amount 
from customers c, orders o
where c.id *= o.customer_id

But before you get excited, this syntax has been long gone for years. *= is no more.

We have had a critical system that started a never-ending migration from one version of the application to another. Remaining were a few sets of users that were still on the old version for a couple years. I kept putting off upgrading the SQL Server in hopes that it would one day go away but the time came that I couldn’t wait any longer. In a farm of 200+ SQL Servers, this was my last SQL 2005 server.

During the migration planning I took a trace and ran it through upgrade adviser. It was about a 15 minute trace during peak business hours and captured a decent amount of activity. Everything checked out with no critical findings. We did discover that the backup client would need upgraded but that would be done anyway as part of the parallel server upgrade.

I got a little greedy, skipped Windows 2008, and spun up a Windows 2012/SQL 2012 server and restored a copy of the database to that server. Some testing was done but not quite enough.

When the big day came to migrate the database to a new server, I followed these steps
1. restore the most recent full backup and leave it in recovery (done a day before)
2. set single user mode and disable all the user accounts on the old server
3. take a differential backup on the old server
4. restore the differential and recover the database on the new server
5. power down the old server and change its host A record to a cname to the new server (I’d rather change connection strings, but this thing is really old and we couldn’t be sure we found them all)
6. Map logins and fix any orphans
7. Take a full backup and start a new log backup chain

Everything went better than expected. I actually planned for 3 hours of downtime in case I had to do a full backup and restore for some reason. It ended up only taking 20 minutes and I could see the clients automatically reconnecting so things looked good. We had planned a thorough checkout, but after the 3 hours so there was about 2 hours and 40 minutes of users connecting before we were really sure everything was ok.

As it turns out, my upgrade adviser trace wasn’t enough data. This old syntax was used in some startup procedure so it didn’t show up in the trace. I’m not sure if upgrade adviser would have caught this but the deprecated code is clearly documented.

https://technet.microsoft.com/en-us/library/ms143729(v=sql.105).aspx

Another interesting thing I learned was SQL2012 does not have the SQL 2000 compatibility level for databases available. That means during my restore, the database was upgraded to 2005 and this old syntax broke. However, SQL 2008 R2 does have the SQL 2000 level available.

I was considering aborting, changing dns back, powering down the new server and powering up the old server. That would have been relatively quick and I could have had my day back. But, that would sign myself and others up for another weekend maintenance window.

I decided to double check this was the problem and I powered up the old server and verified that the query worked on the old database server and not on the new database server. I then checked another 2008 server and tried the syntax out on another old database I found, sure enough it worked so I had another option.

I took the next couple hours to build a new VM, install SQL 2008. I followed the same steps as before and fortunately this time things worked out much better. Just a 6 hour upgrade that should have taken 20 minutes. If you have any SQL 2005 servers still around, waiting until that April 12th end of support deadline, make sure to watch out for the “*=”.

 
Leave a comment

Posted by on January 12, 2016 in SQL Admin

 

Book Review: The Phoenix Project

I’ve broken this post up into two parts, the first directed at convincing you to buy this book and read it several times, and the second to open up discussion for those who have read the book. There will be spoilers in the second part.

PART 1: No Spoilers

Link to Buy on Amazon

I borrowed this book from a co-worker on Friday and finished it Saturday. Yup, done in one day. 382 pages of stories that seem like they could have come straight from my work related nightmares.

The main character Bill takes over after his boss and his boss’s boss both leave the company. The company is not an IT company and the growing complexity of IT has caused great stress and financial loss.

It is an obvious plug for DevOps. By the end of reading you might wonder if there is any other way to get things done. Keep a skeptical view and enjoy this book.

PART 2: SPOILERS

After the first 10 chapters, I didn’t know how much more I could take. I was physically stressed after reading about the constant firefighting, poor communication, late nights, political sabotage, yelling, swearing, night/weekends/all-nighters, and unreasonable demands. The book depicted a sad state of affairs. I recognized some of the outages, and even the blame game comments sounded spot on.

Its like they consolidated the most frustrating parts of my 9 years at my current company into 3 months. I’m a SAN administrator and that first outage of payroll that got blamed on the SAN but ended up being a poorly implemented security feature caused my first wave of stress. It was like watching a horror movie. “corruption” is like the catch all for unknown software errors. If you take action based on wild assumptions, bad things are going to happen. And let me tell you they continue to happen even though the new boss Bill seems to have a calm logical approach to things.

I wonder if this book was written like Dilbert, where the author was simply writing about what really happened to him. Its the only way this could be so close to accurate.

About halfway through the book, I had a guess that 3 of the secondary characters that were helping Bill, especially Erik, may have just been his alternate personalities. Wes is the aggressive obnoxious one, Patty is the over documenter and process type, and Erik is philosophical one. I was actually disappointed that they remained real characters and not imaginary. I think it would have added to the story to find out that Bill had really just been going crazy from all the stress.

Change Control

I loved watching the team be shocked at how many changes actually happen in the ops world that they have been living in. How could they not know? Changes are like queries on a database, sometimes it makes sense to count them, but mostly they are so different that they can’t be counted. One single big change can be more impactful and riskier that 1000 small changes combined.

Who changed what, when? Questions all ops teams should be able to answer. The book describes “changes” as one of four types of work. I’m not really certain how it fits into DevOps. Maybe change control is about reducing unplanned work, which is another type of work.

I liked the compromise they made between using the crappy change control system, but still forcing and encouraging teams to communicate by writing them on cards. It started a habit and the process communicated the vision. It was an early win in their struggles. The system had so many side benefits such as discovering the Brent bottleneck.

I wouldn’t encourage IT departments to use an index card method to schedule changes. Its not searchable and doesn’t scale well. A heavy handed software application with too many required fields is not the best approach either. The key is having clear definitions of what “Change” really means and what systems need to be tracked the most. IE: important financial systems such as payroll.

%Busy/%Idle

This concept hit close to home. My team has lost two people in the last few months and the workload is climbing to unprecedented levels. The automation I’ve put in place is in need of upgrades and important business projects are coming to fruition.

When you are busy, you make mistakes. When you make mistakes, its time consuming to recover. You also take shortcuts that tend to create more work in the long run. Being busy sucks the life out of people.

Decreasing the wait time for IT to add value to the business is was DevOps is all about. The book illustrates this quite well across several fronts. The way Bill achieves some of his goals before achieving kumbaya in the datacenter is with endless hours. He gets denied more people so he takes his salaried workforce and makes something out of nothing.

The graph describes why wait times go through the roof. People can function quite well until they are over 90% busy, from there wait times go through the roof. You can’t squeeze 11% of output out of 10% of idle time. It creates context switching penalties and queuing. This drives the wait times through the roof.

This is why I sometimes work long hours. I know that if I fall behind, it piles up like laundry and I have no clean underwear. It didn’t quite click until I saw the graph in this book but it make total sense. Trying to squeeze that last little bit of production out of a person or process can lead to devastating results.

In the book, Bill realizes he needs to dedicate Brent to project Phoenix. I like the pool of people dedicated to dealing with escalations that usually go to Brent. Its like training without the training. Allowing Brent to focus leads to some interesting automation discoveries later in the book.

Everything is Awesome!

After the first 10 chapters, the book slows down its pace quite a bit. Some characters turn a 180 and everything starts going better. It was a little harder to read and the politics started to take over.

The authors started to apply DevOps approaches to a small team and everything just magically worked. I was hoping there would be continuing issues before they actually got things right but magic pixie dust just made things work. Brent’s server builds just converted over to the cloud without mention of problems or massive costs increases that they already sunk into onsite servers not to mention the architectual shift that would have had to take place to successfully run in the old code in the cloud. But I suppose they were close to 10 deployments a day so it would have been fast right?

 
Leave a comment

Posted by on November 12, 2015 in Virtual