Thursday, January 30, 2014

In The Data Trenches - Part 3 - Table Design


A database table resembles a spreadsheet – data arranged in rows and columns.  However, a spreadsheet is intended to process integers, to sort numbers and perform complex calculations.  In contrast, a database program is designed to organize more disparate types of information, such as dates and text.  The data types available in MS Access 2013 are:

Short Text

Number

Currency

Auto Number

Date and Time

Yes/No

OLE Object

Lookup and Relationship

Rich Text

Long Text

Attachment

Hyperlink

Calculated Field

(As previously mentioned, Access is used as our exemplar because of its wide availability.)

Each field in a table is assigned one of these types, and all information entered into that field must of that type.  For example, entering a hyperlink into a Currency field would produce an error message.

We can apply these fields to the “Status” table discussed in our earlier posting on normalization:

Field Name                                         Field Type

Project                                              Lookup and Relationship

Current Status                                 Long Text

Status Date                                      Date and Time

Next Steps                                       Long Text

Waiting For                                      Long Text

Notes                                                Long Text

The Long Text fields would present fill-in-the blank cells, while the Date field presents a calendar dialog box.  Lookup and Relationship allows the database designer to create either a drop down list of standard terms to be used in this field (“Lookup”) or to tie this table to another one in the database (“Relationship”).  In our example, this table would be tied (“related”) to the main “Projects” table, and the names of all projects would appear here as a drop down list.  The user would merely need to click “Project 1,” “Project 101,” or any other name on the list.  A look up list helps ensure accurate and uniform data entry.  If “Agmt” is the specified description, users must use it, and may not substitute their own terms, such as “contract” or “agreement.”  While these may be accurate, they complicate attempts to search the database. 

While it is helpful to describe each data type consistently and uniformly, each project needs a unique name or label.   Presume the database is intended to organize a company’s contract files.  Individual documents might be described as “agmt,” “corr” or “draft” without causing confusion.  But simply labeling each supply agreement as “supply agmt” could result in confusion.  Which agreement corresponds to which vendor?  A descriptive, and unique name is called for, such as “supply agreement – ABC Co – 2014”. 

 

Which brings us to a fault in the table described above.  It does not ensure that each entry will contain at least one unique entry.  The standard solution is to designate a “key” field, which must be unique, meaning that no two records in the entire database will be identical.  To ensure that the “key” is truly unique, one might specify that it will be an “Auto Number” field, directing the database program to automatically number each record as it is created.  The updated table design might appear as follows:

 

Field Name                                         Field Type

ID                                                         AutoNumber

Project                                                Lookup and Relationship

Current Status                                   Long Text

Status Date                                        Date and Time

Next Steps                                         Long Text

Waiting For                                         Long Text

Notes                                                    Long Text
Once the design is finalized – and the requisite relationships created - one may begin to enter information directly into the table, much as one would with a spreadsheet.  Or one may elect to create an entry form to speed up the process. 

Friday, January 17, 2014

In The Data Trenches, Part 2 – Normalization

Spreadsheets store information in what is known as a “flat file” format – a chart X columns wide and Y rows long.  In contrast, relational databases store information in “tables,” which are connected via “links” or “relationships.”  These relationships allows smaller tables to be combined to provide all the information found in a larger file, but in a format that is easier to administer.
Consider this list of fields in a spreadsheet used as an address book:

Company Name
Address1
Address2
City
State
Zip
Contact Person
Contact Title
Contact Phone
Contact Cell
Contact eMail
 
This file is simple to build and simple to use.  It can be quickly tied to a mailing program to automate correspondence with the contact person.  It is also relatively easy to change if the contact has a change of title or phone number, or is replaced as the principal contact.

But this file does not provide an easy way to track a contact to any new assignment or employer.  One might create a “Contacts” tab in a workbook, and keep individual information there.  But that quickly creates the need to manually cross reference between the “Client Company” tab and the “Contacts” tab, which is inefficient and opens the door to errors.  A relational database can perform such cross references automatically, by tying two different sets of data together.  In this case, the two data sets might be laid out as follows:
Company Name                                       Contact Name
Address1                                                 Company Name
Address2                                                 Title
City                                                         Work Phone
State                                                       Mobile Phone
Zip                                                          eMail

The cross reference is the duplicate field “Company Name.”  Information entered into the first table automatically becomes available in the second, typically as a drop-down selection.  That is one of the great advantages of a relational database – when well designed, it requires information to be entered only once, but permits it to be used in many ways.  But this strength rests on one of greatest complications of relational databases – the need to “normalize” the tables or data sets.

“Normalization” is the process of arranging tables and fields to avoid duplication.  In practice, it is the process of breaking a large data set into component parts, and tying those parts together with relevant links or relationships.

Consider some relevant questions about a client:

·         What is their contact information?

·         How many projects are we handling for them?

·         Who is the contact for each project?

·         What is the budget for each project?

·         What are expenditures to date for each project?

·         What is the current status of each project?

·         What are the next steps for each project?

·         What obstacles have been encountered?

Creating one flat file to capture these disparate types of information would tax even a spreadsheet expert.  A relational database, however, permits each of these questions to be addressed in turn, using tables for Client, Contacts, Projects, Expenses and Status.

CLIENTS                               CONTACTS

Company Name                     Contact Name
Address1                               Company Name
Address 2                              Title
City                                       Phone
State                                     Mobile
Zip                                        eMail
PROJECTS                             EXPENSES                      

Project                                   Project                            
Company Name                      Expense Amt                   
Start Date                              Expense Type                   
Est Completion Date                Date                               
Contact Name                         Notes                               
Budget                                                                          
Notes

STATUS

Project
Current Status
Status Date
Next Steps
Waiting For
Notes

The tables Clients, Projects, Expenses and Status are in order.  Information flows from Clients to Projects via the Company Name field, while Projects, Expenses and Status are all tied using the Project field.  The information for Company Name and Project need be entered only once, suggesting that this portion of the database has been normalized.

The Contacts table, however, may be problematic.  Note that it shares information with both Clients and Projects.  While these relationships do not conflict, they will require that care be taken to ensure that the correct data sources are used when reports are generated or queries run.  A better approach might be to delete the Contact Name field from Projects, and create a separate table consisting of only Contact Name and Project. 

We will explore this question in our next installment, on table design.

Thursday, January 9, 2014

In The Data Trenches, Part One: The Limitations of Spreadsheets

The spreadsheet is central tool of business.  It allows users to easily, rapidly and accurately perform complex analyzes and calculations, providing insights vital to commercial success.  It is also a popular tool for inventories, contact lists, and even complex catalogs of witnesses and testimony or contracts and their terms.  As these catalogs grow, they become unwieldy, particularly if one wishes to cross reference between sets of data.  Spreadsheets are not designed to address such “relationships;” but relational databases are.  This series will discuss the fundamentals of databases and build a simple project tracking tool to illustrate the many benefits these programs can offer.

While there are a number of relational database programs on the market, Microsoft Access is probably the most widely known.  These articles will therefore use Access as their exemplar, but that is not intended as an endorsement of Access over any of its competitors.  Similarly, spreadsheet examples will be based on Microsoft Excel.
A correspondence file is a simple example of a spreadsheet used for record keeping might include the following fields:
  • Date
  • To
  • From
  • Regarding
  • Summary
  • Next Steps
 Technically known as a “flat file,” this file is simple to build and simple to use.  The column headings can be added in moments, and the list can be continued virtually forever, with new rows of entries added as needed.  But that simplicity belies the limitations inherent in this tool. 

A sample entry might be:

  • Date:              1/1/14
  • To:                  Ms. S. Smith
  • From:             Atty. Jones
  • Re:                  Black Acre
  • Summary:     Notice of alleged ground water contamination
  • Next Steps:  Organize meeting with regulators 

But imagine a file containing thousands of such records, on thousands of subjects.  Then imagine scouring the file for all correspondence regarding “Black Acre.” Experience spreadsheet users will recognize that the search can be completed with a simple “Find” (or “Sort") command; the less experienced might well become lost.  In addition, data needs to be keyed into each and every cell (even if one relies upon the "auto complete" features found in current spreadsheet programs).  “Ms. Smith” and “Black Acre” need to typed each time an item of correspondence is recorded.  Each  entry introduces opportunity for error.  “Black Acre” might become “Blak Arce,” “Blac Acre,” “Blk Acre,” or any other combination that tangled fingers might produce.  An electronic search for “Black Acre” will not produce records logged under “Blk Acre.” 

But there are other weaknesses:

  • This file does not offer any contact information.  One must look elsewhere for Ms. Smith’s address. 
  • Shipping fees are evidently captured elsewhere.
  • Each “Next Step” must be added manually to yet another separate record or tracking tool. 
To fully document that January 14 letter to Ms. Smith regarding Black Acre, one must consult this file, the firm’s address book, the billing system and the document management system, if one wants to see the letter itself.  Four searches for one document is hardly efficient.  More, those searches would not touch the underlying document - the notice of contamination mentioned in the "Summmary" field.

A relational database would provide a more robust tool, although it would require more set up.  It would contain separate files (to use the correct term, "tables") for contact information, projects, expenses, next steps and correspondence, tied together through “relationships,” and organized in such a way that each item of information only need be entered once.  After that, each item may be used again and again via drop-down menus.

Organizing these tables to ensure one time data entry is one of the keys to a successful database.  We will take up that question in our next installment.