Friday, January 17, 2014

In The Data Trenches, Part 2 – Normalization

Spreadsheets store information in what is known as a “flat file” format – a chart X columns wide and Y rows long.  In contrast, relational databases store information in “tables,” which are connected via “links” or “relationships.”  These relationships allows smaller tables to be combined to provide all the information found in a larger file, but in a format that is easier to administer.
Consider this list of fields in a spreadsheet used as an address book:

Company Name
Address1
Address2
City
State
Zip
Contact Person
Contact Title
Contact Phone
Contact Cell
Contact eMail
 
This file is simple to build and simple to use.  It can be quickly tied to a mailing program to automate correspondence with the contact person.  It is also relatively easy to change if the contact has a change of title or phone number, or is replaced as the principal contact.

But this file does not provide an easy way to track a contact to any new assignment or employer.  One might create a “Contacts” tab in a workbook, and keep individual information there.  But that quickly creates the need to manually cross reference between the “Client Company” tab and the “Contacts” tab, which is inefficient and opens the door to errors.  A relational database can perform such cross references automatically, by tying two different sets of data together.  In this case, the two data sets might be laid out as follows:
Company Name                                       Contact Name
Address1                                                 Company Name
Address2                                                 Title
City                                                         Work Phone
State                                                       Mobile Phone
Zip                                                          eMail

The cross reference is the duplicate field “Company Name.”  Information entered into the first table automatically becomes available in the second, typically as a drop-down selection.  That is one of the great advantages of a relational database – when well designed, it requires information to be entered only once, but permits it to be used in many ways.  But this strength rests on one of greatest complications of relational databases – the need to “normalize” the tables or data sets.

“Normalization” is the process of arranging tables and fields to avoid duplication.  In practice, it is the process of breaking a large data set into component parts, and tying those parts together with relevant links or relationships.

Consider some relevant questions about a client:

·         What is their contact information?

·         How many projects are we handling for them?

·         Who is the contact for each project?

·         What is the budget for each project?

·         What are expenditures to date for each project?

·         What is the current status of each project?

·         What are the next steps for each project?

·         What obstacles have been encountered?

Creating one flat file to capture these disparate types of information would tax even a spreadsheet expert.  A relational database, however, permits each of these questions to be addressed in turn, using tables for Client, Contacts, Projects, Expenses and Status.

CLIENTS                               CONTACTS

Company Name                     Contact Name
Address1                               Company Name
Address 2                              Title
City                                       Phone
State                                     Mobile
Zip                                        eMail
PROJECTS                             EXPENSES                      

Project                                   Project                            
Company Name                      Expense Amt                   
Start Date                              Expense Type                   
Est Completion Date                Date                               
Contact Name                         Notes                               
Budget                                                                          
Notes

STATUS

Project
Current Status
Status Date
Next Steps
Waiting For
Notes

The tables Clients, Projects, Expenses and Status are in order.  Information flows from Clients to Projects via the Company Name field, while Projects, Expenses and Status are all tied using the Project field.  The information for Company Name and Project need be entered only once, suggesting that this portion of the database has been normalized.

The Contacts table, however, may be problematic.  Note that it shares information with both Clients and Projects.  While these relationships do not conflict, they will require that care be taken to ensure that the correct data sources are used when reports are generated or queries run.  A better approach might be to delete the Contact Name field from Projects, and create a separate table consisting of only Contact Name and Project. 

We will explore this question in our next installment, on table design.

No comments:

Post a Comment