Introduction
The industry has encouraged front end developers to treat
databases purely for storage. Sadly the average developer knows little beyond
SELECT, INSERT, UPDATE & DELETE.
ORM frameworks are the current trend, marketed at speeding development,
portability of code and encapsulating business rules. This however can be at the
cost of SQL efficiency and ultimately performance.
On my current project, the following facts are regularly
ignored –
- The database server has far greater processing capability & memory than the client.
- SQL has a wealth of functionality which (when used correctly) performs fantastically.
- RDBMS programmability is constantly evolving.
This post therefore is to put on record the things I would
do in a perfect world. They aim to be
all encompassing for those starting out on a new transactional application.
Database design
Normalise as much is practical. Aim for 3NF (3rd Normal Form) but be flexible if performance is going to be compromised by this goal. Do not let an ORM tool design the database. The result will have far too many objects and as a result will be heavy on the number of joins.
Data Types
Closely match the data you want to store with a data type. For Numeric types this means the precision. For text this means looking at the length and characters you need to store.
Only use Unicode column types (NCHAR, NVARCHAR) when you need to store Unicode. They require twice the storage yet are frequently used ‘just in case’.
Ensure you don’t use Deprecated data types, i.e. ones that will be removed in future versions of SQL Server. Ntext, text and image were listed as deprecated for SQL 2008, but actually still appeared in both SQL 2012 & 2014.
Data Sizing
Numerical Data types should allow for growth but not be wasteful of storage. Chose tinyint, int, bigint etc appropriately.
Character Columns should be sized appropriately for their content. Do not just guess and oversize them due to not researching the data. Chose between CHAR , VARCHAR and their Unicode equivalents appropriately.
By default CHAR(n) will store n characters and the storage requirement will be n bytes. Unless overridden by ANSI_PADDING settings, CHAR columns containing less data will be padded with blanks up to n characters.
VARCHAR(n) will store up to n characters without padding. There is a 2 byte storage overhead for defining a column as variable length, hence storage required is n+2 bytes.
Stay away from VARCHAR(Max) unless you really need over 8000 characters (the maximum VARCHAR can be defined as). VARCHAR(MAX) can hold up to 2147483647 characters. In reality, the sizes and types of the other columns present in your table will determine if data is not stored ‘in row’.
Nullability
NULL is the presence of no data and is different from an
empty string. NULL values represent data that we do not have. NULLability often provokes strong feelings
from those who have not been involved in database design. Some languages (e.g
coldfusion) treat it as an empty string which simplifies matters for front end
developers.
- Numeric NULLs are important as you cannot represent an unknown number using 0 (zero). Zero is not the same as unknown when dealing with finances or statistics.
- Character columns could represent missing data as either an empty string ‘’ or as NULL. This depends on the application that entered the data. Allowing both would mean you must test for both to return empty columns.
There are a lot of theories, personal preferences and
corresponding web posts surrounding the use of NULLs. Proving or disproving
them is difficult across multiple database platforms, implementations of SQL
and table sizes.
Storage
- In a fixed size column e.g. INT, storing NULL takes the size of the column. (For an INT, 4 bytes).
- In a fixed length column e.g.CHAR, storing NULL takes the length of the column. (For CHAR(10), 10 bytes).
- In a variable length column e.g. VARCHAR , a NULL is stored as 0 bytes (for the data) + 2 bytes (the storage overhead for defining ‘variable’ length).
Code Complexity
Handling NULL values correctly in SQL Server is done via
- The WHERE clause e.g. WHERE column IS [NOT] NULL
- Inline Functions e.g. COALESCE, ISNULL, NULLIF.
NULL values cannot be compared using equals (=). If an
application allows a character column to be either NULL or an empty string (‘’) then you must account for both.
Performance
If there are many NULL values in a queried column, the index
SQL Server may decide it is easier to perform a ‘table scan’, i.e. ignore an
index in place. A table or clustered index scan means the value of the queried
column is compared for each row in turn. This could become an issue as a system
grows.
From SQL 2008 onwards a filtered index could be utilised to explicitly ignore NULL values.
From SQL 2008 onwards a filtered index could be utilised to explicitly ignore NULL values.
NULL By Design
By allowing NULLs in a (traditionally Boolean) BIT column,
the data type can store 3 states.
1, 0, NULL representing TRUE, FALSE and UNKNOWN.
1 byte is used to store up to 8 bit columns, 9 to 16 bit columns will be stored in a 2nd byte etc.
Storing multiple BIT columns in the same table occupies a minimal amount of storage.
1, 0, NULL representing TRUE, FALSE and UNKNOWN.
1 byte is used to store up to 8 bit columns, 9 to 16 bit columns will be stored in a 2nd byte etc.
Storing multiple BIT columns in the same table occupies a minimal amount of storage.
Some ideas to avoid
NULLs
- Use Left Joins to a second table to return the presence of data, rather than hold NULLs in the source.
- Horizontally partition a table into mandatory , frequently used & optional columns. Place optional columns where NULLs are frequently present in the second table with a 1-2-1 mapping on primary key.
- Replace NULLs with a fixed value ( -1) to represent unknown. This would be indexable. Use default values when defining columns to populate this default value
- Expand upon this idea with further keys for Not Applicable, Not Yet Known, Truly Unknown.
- Mask NULLs from the application by using SQL Programability Objects (views , functions & stored procedures).
Constraints
There are 4 types of constraint that can be defined on a
column in a table.
Primary Key
Constraint
A Primary Key is a column (or group of columns) that
uniquely identify a row.
A Primary Key cannot be null. No duplicates are allowed in primary keys.
A Primary Key cannot be null. No duplicates are allowed in primary keys.
Foreign Key
Constraint
Foreign Keys enforce integrity of relationships and prevent
orphaned records.
They will have an overhead in terms of enforcing the order in which data is loaded.
They will have an overhead in terms of enforcing the order in which data is loaded.
Check Constraint
A check constraint can ensure data validity and enforce
business logic by only allowing values that match a defined range.
Unique Constraint
A column can be required to have unique contents by a UNIQUE
Constraint in the definition of the table. It is implemented as a UNIQUE Index
internally to SQL Server. Personally I think it is better to explicitly define
a unique index on a column instead.
Advantages of a unique index are that it can be -
- explicitly named (and therefore tested for in rollout scripts)
- searched (it is an index after all)
- disabled (if you wanted to allow duplicate values, you cannot disable a unique constraint without altering the table itself
- named in an index hint (if you find it necessary to force index use)
In a transactional database, having these constraints in
place is best practice and ensures data integrity. Attempts to violate these
rules could be captured by a well written application.
In a data warehouse scenario where speed of loading records
is paramount, lots of constraints would not be appropriate. Having them would
slow down record insertion, but data would also have already been validated at
source.
Style
Chose a style and stick to it!
Have a naming convention for objects.
Avoid punctuation or characters in object names that you cannot quickly type.
Avoid Reserved Words, i.e. don’t call a table ‘table’, a procedure ‘procedure’ (or vice versa) or a column ‘datetime’.
Adopt a coding style you can easily maintain.
Place SQL keywords on separate lines.
Use tabs and spaces to align column and table names.
Sensibly alias joined tables with meaningful names.
Format complicated WHERE clauses with brackets to make them more readable and split them over multiple lines. Operator Precedence is not an easy thing to memorise
Have a naming convention for objects.
Avoid punctuation or characters in object names that you cannot quickly type.
Avoid Reserved Words, i.e. don’t call a table ‘table’, a procedure ‘procedure’ (or vice versa) or a column ‘datetime’.
Adopt a coding style you can easily maintain.
Place SQL keywords on separate lines.
Use tabs and spaces to align column and table names.
Sensibly alias joined tables with meaningful names.
Format complicated WHERE clauses with brackets to make them more readable and split them over multiple lines. Operator Precedence is not an easy thing to memorise
Schemas
Schemas were introduced in SQL 2005. Whilst they have
management and security benefits their most immediate use is to allow the
separation of database objects into logical groups. The AdventureWorks
downloads on codeplex are a good place to see this in action.
I would use them to separate Application data into areas of Configuration, Access Control, Auditing as well as the application data itself.
I would use them to separate Application data into areas of Configuration, Access Control, Auditing as well as the application data itself.
Programming SQL - Use the database …
At the time of writing, I would approach a new application with the following techniques.
- Use Table Valued Functions and Stored Procedures to return data.
- Use Views (carefully) if sql code is reused frequently. Do not build views on views.
- Use Stored Procedures to perform DML operations
(INSERT, UPDATE & DELETE) rather than allowing an application direct access
to the tables.
(ORM Frameworks can still use stored procedures – You can market this as the best of both worlds). - Perform error checking with TRY/CATCH blocks.
I would do my upmost to avoid …
Scalar functions
These are called once for every row in the results set
They run on a single thread
(they do not scale)
These are called once for every row in the results set
They run on a single thread
(they do not scale)
RBAR Processing
An acronym for ‘Row By Agonising Row’, this describes the
iterative nature or repeating the same code again and again. This could be in CURSORS
or a WHILE loop, the thinking here is the opposite of a SQL SET based approach.
Triggers
I am not a massive fan of triggers as they can quickly make
a database difficult to maintain.
They hide business logic and need to be documented upfront (which we all do, right?)
One of the first things to do when joining a development project is to look for the existence of triggers.
They hide business logic and need to be documented upfront (which we all do, right?)
One of the first things to do when joining a development project is to look for the existence of triggers.
They do allow some useful functionality e.g. an INSTEAD OF trigger can allow a view to become updatable by containing a code to update multiple underlying tables correctly.
Security – Authentication & Authorization
From Day 1 define the necessary security and create ;
- User accounts the application will require. Use the principle of least privilege to assign permissions to these logins.
- A maintenance account for executing jobs that manage the SQL environment e.g. backups, maintenance plans etc.
- An administrator account for managing the server and applying updates.
You can chose to tie your application security and sql
security together via Windows Authentication or develop separate tables for
logins & passwords.
Testing
The world of testing has multiple disciplines nowadays. From
a database developers perspective there are a number of specific areas of
interest.
Unit Testing
Unit Testing is a method of functional testing that can be applied to small
modules of code (units). SQL Stored procedures and functions lend themselves
well to Unit Testing as they perform actions and provide outputs according to
supplied parameters i.e. their inputs.
Detailed Database
Testing can also be done to ensure all aspects of database programming,
data validity and referential integrity. I would personally be doing this as I
go along but the technique is perfectly valid for checking the work of others. http://en.wikipedia.org/wiki/Database_testing
Several ‘Non Functional’ testing disciplines are essential
too.
Performance Testing
These are conducted on designated hardware and can be
subdivided into
- Configuration Testing – How making configuration changes on known hardware affects database performance.
- Load Testing – How the system behaves under a defined load
- Stress Testing – Increasing the load to determine the limits of a system.
- Soak Testing – Endurance testing to ensure a system can cope with a continuous load.
- Spike Testing – How the system performs when suddenly under a greatly increased load, i.e. number of concurrent users, number of active transactions.
- Volume Testing – How the size of a database can influence all the above tests.
Scalability Testing adds to the above list by measuring how
well a system can –
- Scale up (improve hardware by adding RAM, CPUs or disks)
- Scale out (add extra servers)
Both approaches could enable more users and faster response
times if the application/database code also scales.
Applied SQL Testing
Database performance testing does not need to become
complicated and can be aspired to and repeated as development progresses.
Data Generation
Generating a volume of test data is relatively easy.
Especially as test data sets and even random data generators are freely
available on the web. Random transactional data can be generated ad-hoc. If
stored procedures have been used by the application to enter data, calling them
again from a TSQL WHILE loop is an easy task and will quickly build a valid
data set far faster than the application is able to.
Concurrency
Testing concurrency means a reproducing a realistic load. In reality users will simultaneous be performing
different actions at the same time. User A will be entering records, User B
will be searching, User C will be amending data etc.
Testing an application can be automated with expensive tools,
or could be mimicked in SQL. To do this, we need to know patterns of usage i.e.
the order in which stored procedures are called for a given application action.
These can be reproduced in test scripts and populated with random data at run
time. Timings can also be recorded in a table with little overhead.
Dynamically generating SQL agent jobs is a technique we can
draw on to generate multiple requests. We can randomise the number of
simultaneous users, the actions they are performing and the timings between
requests. This will not reproduce client or network issues as results would not
leave the server. It provides a solid benchmark for the capability of the SQL backend.
Data Changes
Knowing what data has changed, when, how and by whom fulfils
auditing requirements. It can also be used to intelligently load a data
warehouse with incremental changes. To achieve this I advocate ensuring
development efforts are compatible with a form of change tracking right from
the start.
Change Tracking (SQL
2008+)
Change Tracking provides a synchronous update of data
changes. Once enabled on a table, you need to query the functions that return
the changes within 24 hours, else they will be lost.
Change Data Capture
(SQL 2008+)
This requires Enterprise Edition of SQL and utilises the
transaction log and sql server agent to provide an asynchronous (after
transaction) update.
Roll your own!
A variety of methods can tell if a row has changed. Using
TRIGGERs is the simplest, but deploying a custom auditing solution could also
be achieved using either ROWVERSION columns or column CHECKSUMs.
Documentation
Document your schema, code, standards and changes from day
one.
Tools exist to assist with documentation, but SQL Server’s own ‘Extended Properties’ are a highly flexible way to achieve this too. The end result is a self-documenting database.
Tools exist to assist with documentation, but SQL Server’s own ‘Extended Properties’ are a highly flexible way to achieve this too. The end result is a self-documenting database.
Application Design Considerations
Here are the things I wish applications would do.
Keep application code
light
Push all data manipulation back to the database server.
Perform connectivity
handling
By this I mean handle timeouts by allowing retries,
customising retry intervals and the number of retry attempts performed. During
this process provide the user with a display e.g. a countdown to the next
retry.
Be a true multi user
application
Implement checking to see if a record has been changed by
another user since it was loaded. It may be a rare event that two users would
change the same client record, but I would advocate adding checking to ensure
data is not lost in this way.
Centralise
Configuration Information
Hold Configuration information in SQL Tables. Ideally an
application configuration file would be a single xml file containing a server,
database, and user names as well as an (encrypted) password. All other
configuration items would be fetched from database table (in the Config
schema!). Login would be permitted by tables in the AccessControl schema and
tables in the Config schema would them be queried again for user specific
settings.
No comments:
Post a Comment