Tuesday, 27 May 2014

My Perfect World

This is an document on 'approach' I've composed this for a number of reasons. It covers a lot of things I've done/could do/will do on future projects. Each area could be massively expanded in terms of detail but it should be enough to inspire ideas in others / keep me on the right track.

Introduction

The industry has encouraged front end developers to treat databases purely for storage. Sadly the average developer knows little beyond SELECT, INSERT, UPDATE & DELETE.

ORM frameworks are the current trend, marketed at speeding development, portability of code and encapsulating business rules. This however can be at the cost of SQL efficiency and ultimately performance.

On my current project, the following facts are regularly ignored –

  • The database server has far greater processing capability & memory than the client.
  • SQL has a wealth of functionality which (when used correctly) performs fantastically.
  • RDBMS programmability is constantly evolving.


This post therefore is to put on record the things I would do in a perfect world.  They aim to be all encompassing for those starting out on a new transactional application.

Database design

Normalise as much is practical. Aim for 3NF (3rd Normal Form) but be flexible if performance is going to be compromised by this goal. Do not let an ORM tool design the database. The result will have far too many objects and as a result will be heavy on the number of joins.

Data Types

Closely match the data you want to store with a data type.  For Numeric types this means the precision.  For text this means looking at the length and characters you need to store.

Only use Unicode column types (NCHAR, NVARCHAR) when you need to store Unicode. They require twice the storage yet are frequently used ‘just in case’.

Ensure you don’t use Deprecated data types, i.e. ones that will be removed in future versions of SQL Server. Ntext, text and image were listed as deprecated for SQL 2008, but actually still appeared in both SQL 2012 & 2014.

Data Sizing

Numerical Data types should allow for growth but not be wasteful of storage. Chose tinyint, int, bigint etc appropriately.

Character Columns should be sized appropriately for their content. Do not just guess and oversize them due to not researching the data. Chose between CHAR , VARCHAR  and their Unicode equivalents appropriately.

By default CHAR(n) will store n characters and the storage requirement will be n bytes. Unless overridden by ANSI_PADDING settings, CHAR columns containing less data will be padded with blanks up to n characters.

VARCHAR(n) will store up to n characters without padding. There is a 2 byte storage overhead for defining a column as variable length, hence  storage required is n+2 bytes.

Stay away from VARCHAR(Max) unless you really need over 8000 characters (the maximum VARCHAR can be defined as). VARCHAR(MAX) can hold up to 2147483647 characters. In reality, the sizes and types of the other columns present in your table will determine  if data is not stored ‘in row’.


Nullability

NULL is the presence of no data and is different from an empty string. NULL values represent data that we do not have.  NULLability often provokes strong feelings from those who have not been involved in database design. Some languages (e.g coldfusion) treat it as an empty string which simplifies matters for front end developers.

  • Numeric NULLs are important as you cannot represent an unknown number using 0 (zero).  Zero is not the same as unknown when dealing with finances or statistics.
  • Character columns could represent missing data as either an empty string ‘’ or as NULL.  This depends on the application that entered the data. Allowing both would mean you must test for both to return empty columns.

There are a lot of theories, personal preferences and corresponding web posts surrounding the use of NULLs. Proving or disproving them is difficult across multiple database platforms, implementations of SQL and table sizes.

Storage

  1. In a fixed size column e.g. INT, storing NULL takes the size of the column. (For an INT, 4 bytes). 
  2.  In a fixed length column e.g.CHAR, storing NULL takes the length of the column.  (For CHAR(10), 10 bytes).
  3. In a variable length column e.g. VARCHAR , a NULL is stored as 0 bytes (for the data) + 2 bytes (the storage overhead for defining ‘variable’ length).
Code Complexity 

Handling NULL values correctly in SQL Server is done via
  1. The WHERE clause e.g. WHERE column IS [NOT] NULL
  2. Inline Functions e.g. COALESCE, ISNULL, NULLIF.
NULL values cannot be compared using equals (=). If an application allows a character column to be either NULL or an empty string  (‘’) then you must account for both.

Performance

If there are many NULL values in a queried column, the index SQL Server may decide it is easier to perform a ‘table scan’, i.e. ignore an index in place. A table or clustered index scan means the value of the queried column is compared for each row in turn. This could become an issue as a system grows.
From SQL 2008 onwards a filtered index could be utilised to explicitly ignore NULL values.

NULL By Design

By allowing NULLs in a (traditionally Boolean) BIT column, the data type can store 3 states.
1, 0, NULL representing TRUE, FALSE and UNKNOWN.
1 byte is used to store up to 8 bit columns, 9 to 16 bit columns will be stored in a 2nd byte etc.
Storing multiple BIT columns in the same table occupies a minimal amount of storage.

Some ideas to avoid NULLs
  • Use Left Joins to a second table to return the presence of data, rather than hold NULLs in the source.
  • Horizontally partition a table into mandatory , frequently used & optional columns. Place optional columns where NULLs are frequently present in the second table with a 1-2-1 mapping on primary key.
  • Replace NULLs with a fixed value  ( -1) to represent unknown. This would be indexable. Use default values when defining columns to populate this default value
  •  Expand upon this idea with further keys for Not Applicable, Not Yet Known, Truly Unknown.
  • Mask NULLs from the application by using SQL Programability Objects (views , functions & stored procedures).

Constraints 

There are 4 types of constraint that can be defined on a column in a table.

Primary Key Constraint

A Primary Key is a column (or group of columns) that uniquely identify a row.
A Primary Key cannot be null.  No duplicates are allowed in primary keys.

Foreign Key Constraint

Foreign Keys enforce integrity of relationships and prevent orphaned records. 
They will have an overhead in terms of enforcing the order in which data is loaded.

Check Constraint

A check constraint can ensure data validity and enforce business logic by only allowing values that match a defined range.

Unique Constraint

A column can be required to have unique contents by a UNIQUE Constraint in the definition of the table. It is implemented as a UNIQUE Index internally to SQL Server. Personally I think it is better to explicitly define a unique index on a column instead.

Advantages of a unique index are that it can be -

  • explicitly named (and therefore tested for in rollout scripts)
  • searched (it is an index after all)
  • disabled (if you wanted to allow duplicate values, you cannot disable a unique constraint without altering the table itself
  •  named in an index hint (if you find it necessary to force index use)
In a transactional database, having these constraints in place is best practice and ensures data integrity. Attempts to violate these rules could be captured by a well written application.

In a data warehouse scenario where speed of loading records is paramount, lots of constraints would not be appropriate. Having them would slow down record insertion, but data would also have already been validated at source.



Style

Chose a style and stick to it!
Have a naming convention for objects.
Avoid punctuation or characters in object names that you cannot quickly type.
Avoid Reserved Words, i.e. don’t call a table ‘table’, a procedure ‘procedure’ (or vice versa) or a column ‘datetime’.

Adopt a coding style you can easily maintain.
Place SQL keywords on separate lines.
Use tabs and spaces to align column and table names.
Sensibly alias joined tables with meaningful names.
Format complicated WHERE clauses with brackets to make them more readable and split them over multiple lines. Operator Precedence is not an easy thing to memorise

Schemas 

Schemas were introduced in SQL 2005. Whilst they have management and security benefits their most immediate use is to allow the separation of database objects into logical groups. The AdventureWorks downloads on codeplex are a good place to see this in action.
I would use them to separate Application data into areas of Configuration, Access Control, Auditing as well as the application data itself.


Programming SQL - Use the database

At the time of writing, I would approach a new application with the following techniques.
  • Use Table Valued Functions and Stored Procedures to return data.
  • Use Views (carefully) if sql code is reused frequently. Do not build views on views.
  • Use Stored Procedures to perform DML operations (INSERT, UPDATE & DELETE) rather than allowing an application direct access to the tables.
    (ORM Frameworks can still use stored procedures – You can market this as the best of both worlds).
  • Perform error checking with TRY/CATCH blocks.

I would do my upmost to avoid …
Scalar functions
These are called once for every row in the results set
They run on a single thread
(they do not scale)

RBAR Processing
An acronym for ‘Row By Agonising Row’, this describes the iterative nature or repeating the same code again and again. This could be in CURSORS or a WHILE loop, the thinking here is the opposite of a SQL SET based approach.

Triggers
I am not a massive fan of triggers as they can quickly make a database difficult to maintain.
They hide business logic and need to be documented upfront (which we all do, right?)
One of the first things to do when joining a development project is to look for the existence of triggers.

They do allow some useful functionality e.g. an INSTEAD OF trigger can allow a view to become updatable by containing a code to update multiple underlying tables correctly.


Security – Authentication & Authorization
From Day 1 define the necessary security and create ;
  1. User accounts the application will require. Use the principle of least privilege to assign permissions to these logins.
  2. A maintenance account for executing jobs that manage the SQL environment e.g. backups, maintenance plans etc.
  3. An administrator account for managing the server and applying updates.
You can chose to tie your application security and sql security together via Windows Authentication or develop separate tables for logins & passwords.


Testing

The world of testing has multiple disciplines nowadays. From a database developers perspective there are a number of specific areas of interest.

Unit Testing
Unit Testing is a method of functional testing that can be applied to small modules of code (units). SQL Stored procedures and functions lend themselves well to Unit Testing as they perform actions and provide outputs according to supplied parameters i.e. their inputs.

Detailed Database Testing can also be done to ensure all aspects of database programming, data validity and referential integrity. I would personally be doing this as I go along but the technique is perfectly valid for checking the work of others. http://en.wikipedia.org/wiki/Database_testing

Several ‘Non Functional’ testing disciplines are essential too.
Performance Testing

These are conducted on designated hardware and can be subdivided into
  • Configuration Testing – How making configuration changes on known hardware affects database performance.
  • Load Testing – How the system behaves under a defined load
  • Stress Testing – Increasing the load to determine the limits of a system.
  • Soak Testing – Endurance testing to ensure a system can cope with a continuous load.
  • Spike Testing – How the system performs when suddenly under a greatly increased load, i.e. number of concurrent users, number of active transactions.
  • Volume Testing – How the size of a database can influence all the above tests.
Scalability Testing adds to the above list by measuring how well a system can –
  • Scale up (improve hardware by adding RAM, CPUs or disks)
  • Scale out (add extra servers)
Both approaches could enable more users and faster response times if the application/database code also scales.


Applied SQL Testing

Database performance testing does not need to become complicated and can be aspired to and repeated as development progresses.

Data Generation

Generating a volume of test data is relatively easy. Especially as test data sets and even random data generators are freely available on the web. Random transactional data can be generated ad-hoc. If stored procedures have been used by the application to enter data, calling them again from a TSQL WHILE loop is an easy task and will quickly build a valid data set far faster than the application is able to.

Concurrency

Testing concurrency means a reproducing a realistic load.  In reality users will simultaneous be performing different actions at the same time. User A will be entering records, User B will be searching, User C will be amending data etc.

Testing an application can be automated with expensive tools, or could be mimicked in SQL. To do this, we need to know patterns of usage i.e. the order in which stored procedures are called for a given application action. These can be reproduced in test scripts and populated with random data at run time. Timings can also be recorded in a table with little overhead. 

Dynamically generating SQL agent jobs is a technique we can draw on to generate multiple requests. We can randomise the number of simultaneous users, the actions they are performing and the timings between requests. This will not reproduce client or network issues as results would not leave the server. It provides a solid benchmark for the capability of the SQL backend.

Data Changes

Knowing what data has changed, when, how and by whom fulfils auditing requirements. It can also be used to intelligently load a data warehouse with incremental changes. To achieve this I advocate ensuring development efforts are compatible with a form of change tracking right from the start.

Change Tracking (SQL 2008+)

Change Tracking provides a synchronous update of data changes. Once enabled on a table, you need to query the functions that return the changes within 24 hours, else they will be lost.

Change Data Capture (SQL 2008+)

This requires Enterprise Edition of SQL and utilises the transaction log and sql server agent to provide an asynchronous (after transaction)  update.

Roll your own!

A variety of methods can tell if a row has changed. Using TRIGGERs is the simplest, but deploying a custom auditing solution could also be achieved using either ROWVERSION columns or column CHECKSUMs.


Documentation

Document your schema, code, standards and changes from day one.
Tools exist to assist with documentation, but SQL Server’s own ‘Extended Properties’ are a highly flexible way to achieve this too. The end result is a self-documenting database.

Application Design Considerations 

Here are the things I wish applications would do.

Keep application code light  

Push all data manipulation back to the database server.

Perform connectivity handling 

By this I mean handle timeouts by allowing retries, customising retry intervals and the number of retry attempts performed. During this process provide the user with a display e.g. a countdown to the next retry.

Be a true multi user application

Implement checking to see if a record has been changed by another user since it was loaded. It may be a rare event that two users would change the same client record, but I would advocate adding checking to ensure data is not lost in this way.

Centralise Configuration Information

Hold Configuration information in SQL Tables. Ideally an application configuration file would be a single xml file containing a server, database, and user names as well as an (encrypted) password. All other configuration items would be fetched from database table (in the Config schema!). Login would be permitted by tables in the AccessControl schema and tables in the Config schema would them be queried again for user specific settings.

No comments: