RSS

Monthly Archives: November 2011

back to t-sql basics (1 of 2)

The first 236 pages don’t get into IUDs or Insert, Updates and Deletes. No data modification in the first 7 chapters which is how DBAs should start out. I am about 1/2 way through the book so I decided I would pause and type up some of my notes and thoughts.

T-SQL Fundamentals is one of the first technical books I have decided to actually read. All of the other technical books I have read I used as a reference manual and maybe focused on a chapter that relates to something that had me stumped.

Taking a pen to paper really helps me stay focused when reading. I use blank sheets of paper without lines so if I feel the urge to draw a picture or change a font size I can do so without limitation. The author stresses several times to do the exercises to really understand what is being covered. I agree, and that is part of my Learn One, Do One, Teach One path to mastery.

The first notable item that I have come across in the book is a definition of the resource database. This appears to be a behind the scenes database that stores system objects. Short explanaions of all the systems databases are as follows:

(1, p15)
1. master – stores meta information and configs
2. model – new db template
3. tempdb – work tables, sort space, row versions (rebuilt on sql restart)
4. msdb – sql agent data
5. Resource – system objects

Unique constraints – first thing that comes to mind is a primary key, however these can be things like SSN or phone number that are not the primary key.
(1, p21)
A foreign key is a link to another table.

Alter table Orders
add constraint 'FK_Blah'
foreign key (empid)
references dbo.employees(empid)

The order of which queries are processed is important when trying to understand the results.
(1, p26)
1. FROM clause – code should always schema qualify tables in the from clause “dbo.table” instead of “table”
2. WHERE clause – the filter
3. GROUP BY clause – SUM() AVG() agg functions are required for selected columns when grouping so only one value is returned
4. HAVING clause – another filter
5. SELECT – column or fields to return
6. ORDER BY clause – It is stressed several times that order is not guaranteed unless this is added

Agg functions ignore NULLs except for count(*)
(1, p42)
TOP – an option to limit results. processed as part of the select which is similar the time DISTINCT

(1, p45)
OVER clause – exposes a window of rows to an aggregate function

select orderid, custid, val,
100. * val/sum(val) OVER() as pct_all
100. * val / sum(val) OVER(PARTITION BY custid) as pct_cust
from sales.ordervalues;

note 100. causes an implicit conversion to decimal where as 100 would be implicit to int

This sample OVER query returns two percentages for each orderid so you can tell how big the order was compared to that customers orders(the partitioned one) and all orders.

OVER supports 4 ranking functions

DENSE_RANK – 5,6,6,8
RANK – 5,6,6,7
ROW_NUMBER – 5,6,7,8
NTILE – 1,1,1,1 (I assume this is like a percentile. if you pass in 100 you get 1 for top 1 percentile)

Predicates
IN
BETWEEN
LIKE

(1, p53)
Operators
()
*(multiply) / (divide) % (modulo)
+ –
= > < !=
NOT
AND
OR

(1, p54)
CASE statements

select prod, prodname, cat_id
CASE cat_id
  WHEN 1 then 'Bevs'
  WHEN 2 then 'produce'
END as category name
from products

(1, p58)
An important thing to note about nulls:

select * from blah where col  'MI'

Although NULL does not = MI this query will not return records where col is NULL

col = NULL <—-returns no rows
col IS null <—- returns proper results

NULLS are sorted before values in col such as 'MI' or 'AL'

Also, sql doesn't process left to right in where clauses. EX:

Where col1 0 AND col2/col1 > 2

(1, p69)
Character Data (just the highlights)
STUFF -> remove substring and insert new string
UPPER -> convert to uppercase
SELECT RIGHT(‘12345’, 3) -> 345
wildcards -> %, _, [^A-E]

(1, p79)
Dates
DATETIME
SMALLDATETIME
DATE
TIME
DATETIME2

SQL Server recognizes string literals such as ‘20070212’ and will convert. Its suggested to always use YYYYMMDD so language options don’t get confused.

This query is bad for indexes

select blah from orders
WHERE YEAR(orderdate) = 2007

Change the where clause to orderdate = ‘20070101’ so the function doesn’t have to complete for every row.

CAST(GETDATE() as TIME)
select DATEADD(YEAR, 1, ‘20091010’) -> 2010-10-10 00:00:00.000
DATEDIFF(day, ‘20080101’, ‘20090101’) -> 365
DATEPART(month, ‘20080210’) –> int of 2
ISDATE(‘20090101’) 1=TRUE 0=FALSE

(1, p101)
JOINS
The inner workings of these escaped me for quite a while. As soon as I thought I knew what was going on I would be challenged to write a multi-table query and struggle.

The different types of joins are Cross, Inner(default), Outer
Apply, Piviot and Unpiviot are other types of joins

A cross join creates a Cartesian product where you simply get a result = (tableA # of rows) * (tableB # of rows)

SELECT c.id, e.id
from sales.customers as C
cross join hr.employees as E

results:

id|id
_____
1 |1
1 |2
2 |1
2 |2

The book recommends using ANSI-92 format instead of ANSI-89.

ANSI-89 – DO NOT USE

select a.blah, b.blah
from table1 as a, table2 as b
where a.blah = b.blah

note: without the where clause it would be a cross join

ANSI-92 – GOOD

select c.custid, count(*) as numorders
from sales.customers as c
left outer join sales.orders as o
on c.custid = o.custid
group by c.custid;

This will get a table with rows for every order and then all matching customer id’s will be grouped together and the second column will be a number of rows that were grouped.

Sub Queries (1, p133)
These can solve a lot of the same problems as joins. There are two flavors, self contained and correlated. Self contained are much easier, especially when debugging because the inner query can be ran separately from the outer query.

Most self contained inner queries return a scalar, or a single value.

select orderid, orderdate
from sales.orders
where orderid = (select max(o.orderid) from sales.orders as o)

Some self contained inner queries return multiple values by using the IN operator instead of =.

select orderid, orderdate
from sales.orders
where orderid IN (select o.orderid from sales.orders as o where o.lastname like 'D%')

Another example of a subquery with exists in a where clause. Most of the time we should avoid where clause function type processing but the EXISTS clause is performent.

where blah = blah
and exists (select blah)

Table expressions and derived tables

select orderyear, count(distinct custid) as numcusts
from (select year(orderdate) as order year, custid
      from sales.orders) as D
group by orderyear

(1, p167)
Common table expressions (CTEs) are better than derived tables because they have recursive capabilities

WITH USACusts as
(
select custid, companyname
from sales.customers
where country='usa'
)

Set operations

Operations performed on two query results. ORDER BY returns a cursor so the sets cannot use this.

(1, p197)
UNION (distinct is implicit)
-union all

INTERSECT – return rows that are in both query and query

EXCEPT – exclude results that are in query B

Source
1. Ben-Gan, I. (2009). Microsoft SQL Server 2008 T-SQL Fundamentals. Redmond, Washington: Microsoft Press.

Advertisements
 
Leave a comment

Posted by on November 28, 2011 in SQL Dev

 

back to t-sql basics (intro)

At the SEMSSUG meeting a couple months ago I was one of the early prize winners and selected the book Microsoft SQL Server 2008 T-SQL Fundamentals by Itzik Ben-Gan.
T-SQL Fundamentals
It is on the top 10 books lists for many DBAs. I’ve been a DBA for a fortune 500 company over 2 years now but do not get much practice in T-SQL. A large percentage of the almost 100 SQL servers I maintain run ISV applications. This doesn’t remove the need for me to learn database design and T-SQL it just removes a lot of the practice and experience I would get if we instead hosted more in-house applications. I tend to stay on the server side of SQL. I don’t get my hands dirty writing a lot of queries, designing table layouts or even adding or removing indexes. What I do is roll out new servers and balance a semi-consolidated virtual environment.

When things start to go poorly I examine the hardware, virtual, OS and SQL server layers to see if there are any tweaks or upgrades that would be justified. In the past I have stuck to the big four resources CPU/MEM/DISK/NETWORK to identify which may be the one that is most heavily used and causing a bottleneck. I like to couple this information with the top SQL and SQL waits during the timeframe that things started to go poorly. Often times this information gets packed up and shipped off to the software vendor where it enters some kind of black hole.

Learning T-SQL and database design will help me get to the root of the 5th resource bottleneck, blocking. Blocking, along with stupidly bad RBAR has been the root cause for a lot of performance problems that users actually complain about. “My report took 2 seconds yesterday but 10 minutes today, what gives?” Users tend to complain when things are 1.5x or more slower than expected. On the flip side they will only give a compliment if the software is no less than 100x faster than it used to be.

Sometimes it is a physical resource problem. Its hard to estimate how much hardware a 3rd party app is going to need because of that 5th resource bottleneck. Sometimes we choose accurately up front but then the user base will grow and queries can’t get in and out fast enough. Blocking would be evident in most of these cases but a simple hardware upgrade could reduce blocking by getting queries in and out faster.

The purpose for me putting a lot of hours recently into T-SQL and database design is because of that black hole where the “code vs. servers” support tickets tend to go. Of course I want to be a more well rounded DBA but the new job requirement has gotten be motivated. I would like to recover some of these tickets that enter the black hole by giving valid solutions to more of the top (or bad) SQL statements that lock servers and chew up shared resources. I would not need to defend my big 4 resource choices if I could identify and offer solutions to the poorly performing queries.

I have been thoroughly reading, re-reading, taking notes and doing the exercises in Mr. Ben-Gan’s book. I have made it through the first 8 chapers and will post some of my notes in the near future. Most of the examples are based around a small database that can be downloaded with all scripts here: http://tsql.solidq.com/resources.htm

 
Leave a comment

Posted by on November 18, 2011 in SQL Dev

 

Visualizing multi-threading with Fiddler

Fiddler is an awesome web debugging proxy.  It allows you to see and drill into each HTTP request as they happen. It shows the URL, headers, content, results and much more.

I started using Fiddler when I was bound by a very strict internet policy.  It has the ability to auto-respond to HTTP requests from my computer.  It acts as a proxy and sits inbetween your browser and network card.  These auto-responders proved very useful when I was trying to load a page that had several blocked items on it.  I could first use it to identify which requests were showing the generic blocked page message, then setup an auto-responder to have fiddler return a simple blank.gif instead of my computer actually making the HTTP request.  This prevented my computer from showing up in a list on some network admins software. This website loaded a bunch of unnecessary images as well and I was only concerned with the text.  I setup a *.jpg and *.gif auto responder to speed up the load time as well.

I hadn’t found a good use for fiddler in a while until I was working on a project to make several HTTP requests.  I started off single threading and got the application working.  Then I decided to add multi-threading and it appeared to go a little faster.  I was happy until I need to add a longer list of HTTP requests to the program.  I wasn’t quite sure I was getting a performance benefit from the multi-threading.  Stepping through the code in the debugger didn’t really show me exactly when the requests were being made because the break-points would get hit one at a time.

I had starting reading up on all the ways VB.NET can do multi-threading to see if there was an easier way.  I looked into parallel.foreach loops but read that they would only spawn extra threads if there were extra processors available.  I am interested in getting that particular parallel loop working so I can visualize that with fiddler as well.  The other threading tool I looked into was the background worker.

Background workers seemed to be designed to help with windows forms.  This type of threading helps forms from doing the classic “Not Responding” because you are trying to use the thread that is processing the form appearance and controls.

Thread pooling is the route I want to take in the future.  It gives you the ability to limit the number of threads you spawn at a time.  This would be perfect for making HTTP requests because you don’t want to overload a server using a form of “free threading”.  Fire and Forget is another name for a particular style of threading where you don’t care about the response of the thread. Fire and Forget can be used when the results are logged to a database.

My program was based around one function.  It would simply take a URL parameter and return a string of all the HTML on the page. Each URL in my list would take 1-5 seconds to pull a response depending on the server load. So in a single threaded program would take at least 5 seconds to complete 5 URLs and up to 25 seconds just to check 5 URLs.

This was an obvious candidate for multi-threading.  If those same 5 HTTP requests were sent in the first second of execution the program could complete in 1-5 seconds instead of 5-25 seconds.  The scaleablility would also be much better because the execution time wouldn’t go up that much if I added 100 URLs.

So I wrote the program.  The keys to multi threading success were:

1. setup a class wrapper around my function
2. pass URL in with the thread.start overload
3. create an event to handle the function response
4. use synclock so the threads don’t run into each other when writing their results
5. join the thread list together so you don’t continue on in your code before the multi-threading part is complete

Imports System.Net
Imports System.IO
Imports System.Security.Cryptography
Imports System.Text
Imports System.Threading

Module Module1
    Dim WithEvents oReadSiteHTML As New ReadSiteHTMLClass
    Sub Main()
        Dim URLlist() As String = System.IO.File.ReadAllLines("sitelist.txt")

        'Multi-Threaded
        Dim ThreadList As New ArrayList

        Dim stopper As New Stopwatch
        stopper.Start()

        For Each tempurl In URLlist
            Dim t As New Thread(AddressOf oReadSiteHTML.GetHTML)
            ThreadList.Add(t)
            t.Start(tempurl)
        Next

        For Each t In ThreadList
            t.join()
        Next

        stopper.Stop()
        Dim timetook As String = stopper.ElapsedMilliseconds.ToString()

        'single threaded

        Dim stopper2 As New Stopwatch
        stopper2.Start()

        For Each tempurl In URLlist
            oReadSiteHTML.GetHTML(tempurl)
        Next

        stopper2.Stop()
        Dim timetook2 As String = stopper2.ElapsedMilliseconds.ToString()

    End Sub

    Private Function GenerateHash(ByVal SourceText As String) As String
        Dim Uni As New UnicodeEncoding()
        Dim ByteSourceText() As Byte = Uni.GetBytes(SourceText)
        Dim Md5 As New MD5CryptoServiceProvider()
        Dim ByteHash() As Byte = Md5.ComputeHash(ByteSourceText)
        Return Convert.ToBase64String(ByteHash)
    End Function

    Public Class ReadSiteHTMLClass
        Public SiteHtmlHash As String
        Public Event ThreadHash(ByVal SiteHtmlHash2 As String)
        Public Sub GetHTML(ByVal URL As String)
            Dim request As HttpWebRequest = WebRequest.Create(URL)
            Dim response As HttpWebResponse = request.GetResponse()
            Dim reader As StreamReader = New StreamReader(response.GetResponseStream())
            Dim str As String = reader.ReadLine()
            Dim sitefulltext As String = ""
            Do While (Not str Is Nothing)
                sitefulltext = String.Concat(sitefulltext, str)
                str = reader.ReadLine()
                If str Is Nothing Then
                    Exit Do
                End If
            Loop
            reader.Close()
            SiteHtmlHash = GenerateHash(sitefulltext)

            RaiseEvent ThreadHash(SiteHtmlHash &amp; ":" &amp; URL)
        End Sub
    End Class

    Sub ThreadHash(ByVal SiteHtmlHash3 As String) Handles oReadSiteHTML.ThreadHash
        SyncLock GetType(ReadSiteHTMLClass)
            File.AppendAllText("hashes.txt", SiteHtmlHash3 &amp; System.Environment.NewLine)
        End SyncLock
    End Sub

End Module

Timing the code I could tell that the multi threading part was faster. But I was concerned that the sync lock was somehow preventing my code from being able to instantiate the class multiple times.  I am not sure why I thought this but it was easy to prove myself wrong by opening up Fiddler. When I ran the code below I saw the requests all pop in simultaneously and respond at almost the same time, then the single threaded fashion each subsequent HTTP request had to wait for the previous response. It was a real “ah ha” moment where I realized my code was working just the way I wanted it to… well almost.  Once I get into really high numbers of URLs, I want to add the thread pooling.  Hopefully then I can use fiddler and see only 8 or 10 active HTTP requests at a time. Right now the program will launch a thread for every URL which can be called “free threading”.

One thing I have explained a couple times before is the difference between threads and cores.  Hyper threading is like a thread pool of 2 but at the hardware level. A 4 core box with hyperthreading on will show 8 cores in task manager. You can have a multi-threaded app perform much better than a single threaded app even if there is only 1 CPU. Furthermore, you can have as many threads as you want and the CPU will break these down into simple units also allowing for other operations to interrupt your app.  Back to the example, the key performance gain is the fact you don’t have to wait for one response to send another request.

The screen capture below shows that the requests multiple requests downloading simultaneously and several are already finished.

One thing to be cautious of is overloading a server. All servers are not super computers. If the server is doing all the work and all you have to do is send a request its easy to overload a server with just one client. The trick is to find the sweet spot in performance and be able to adjust that. Thread pooling gives you a single variable number that you can adjust with each change in the environment.

There are also some automatic security switches that can be used on networks and inside website code to identify malicious behavior. Don’t be the one responsible for getting your company on some blacklist.

 
Leave a comment

Posted by on November 2, 2011 in .NET