SQL Server - Survey Data Model - How to avoid EAV and excessive denormalization?

Asked By Alex on 06-Jan-11 10:13 AM
Hi everyone,

My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model (http://en.wikipedia.org/wiki/Entity-attribute-value_model) and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.

I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up. See for example this thread for a typical example of what I found so far: http://stackoverflow.com/questions/1764435/database-design-for-a-survey/1764901#1764901

Some assumptions about the data I have to deal with:
  1. Each survey consists of 1 to n questionnaires
  2. Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
  3. Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
  4. Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
  5. Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
  6. A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
  7. The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
  8. The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
  9. I will have to write the integration and cleaning software, which will need to work with every country's data
  10. In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.
I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.
  1. The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
  2. O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model (see for example this thread http://www.eggheadcafe.com/software/aspnet/32645959/generic-datatype-table.aspx). I tend to follow Celko here, who strongly discourages its use. The latter would imply that each row would contain null cells for the not applicable types by design.
Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.

Sorry for boring you with all this text and thank you for your input!

Cheers,
Alex

I have successfully built such a system - Robbe Morris replied to Alex on 06-Jan-11 10:31 AM

but won't give you all the details because it would be giving away to much of our intellectual property.  I will say this though, the correct answer is a normalized and centralized data structure.  Our system has only 5 tables that deal with question and answer data.  The rest of the tables deal with rendering the user interface.  In total, we have less than 30 tables including simple lookup/cross reference tables.

The stackover flow suggested answer with 7 votes is on track to what you need but has serious relationship flaws that affect its flexibility and data mining capabilities.  Look at its tight integration and move towards a more loosely coupled design.

That's all ya get. :)

As an fyi... I used to be a senior engineer for the IT research firm Gartner.  Made my living building sophisticated data capture and decision making software for IT executives to use.  In other words, I've seen the upfront challenges you are facing as well as the more complex challenges and evolution of data mining requirements that are coming your way.

Pat Hartman replied to Alex on 07-Jan-11 02:31 AM
I don't have a survey app that I can share but here's some stuff from an Insurance app that will show you how to do what you need to do with an EAV structure.  The FieldType defines what type of data the field (answer in your case) will hold.  The ControlType defines what type of control will be used to display the data.  And, the combo data source holds the name of a query that will be used to populate the RowSource when the ControlType is Combo.

The following is a picture of the data entry form.  It lists the fields (questions) in the specified order for this product type.  The field name (question) is displayed and the Value field is used to hold the "answer".  There are two pictures following.  The first shows what the form looks like when the cursor is in a text or memo field and the second shows what happens when the cursor is in a combo field.  See how the down arrow appears on each row.  This is an Access issue that I was never able to overcome due to the way properties are handled on a continuous or datasheet form.  Since the issue is visual only, the users easily learn to live with it.

Alex replied to Pat Hartman on 07-Jan-11 04:46 AM
Robbe & Pat,

Thanks for both of your replies.

I'm still unsure as to what exactly my Answers table(s) should look like in the database. I tend to treating answers as entities in their own right and not properties of a respondent. This means I need one or more tables with one row per answer. The problem is that answers are of different data types, depending on the corresponding question (e.g., multiple choice, free text). I don't want to squeeze them all in a sql_variant column, because it would make integrity constraints more cumbersome (and I'm worried about performance, although I don't really know about that). So I have to decide between having one Answers table with one column per data type and one Answers table per data type. The former leads to n(answers types) -1 null values per row, the latter leads to multiple tables for the same kind of entity.

What would you do?

Thank you
Alex
Pat Hartman replied to Alex on 07-Jan-11 06:13 PM
In the application I posted pictures of, the number of rows in the "answers" table grew to over a million rows.  When it exceeded 500,000, I took a couple of hours and upsized from Jet to SQL server (which had always been the plan).  The table grows by about a million rows per year and each year, policies older than 18 months old are archived.  They are still kept in the company's main policy application of course but this app is only used for collecting the initial data and producing the policy documents via Word automation.  Archiving keeps the app from being unmanageable.  

This app grew out of a client's need for responsiveness from IT to support growing product options.  With their "flat" web app, it was taking 3 months to create the tables and build the forms to support a new product (survey).  With the design I chose, the client didn't need IT support at all to define a new product(survey).  He simply defined new fields (questions) as needed including their data type, control type, and combo RowSource query if necessary).  They then defined a new product (survey) and picked the fields (questions) to be included and specified the order in which they should be displayed.  The ultimate result was an app that could grow quickly with the business and a new product could be created within a few hours or days depending on how many Word documents (and how different they were from other docs) needed to be created.

I rely heavily on RI and not being able to implement it for the fields (answers) table was disturbing to me at first but I found that it wasn't a problem because there was only one form where the values were updated and so for the fields that needed consistent choices, I just used combos to control the selections and of course for the free text it didn't matter.  The numeric data is stored as plain numbers so that it is easy to convert to numeric in queries if arithmetic operations are required and the data type defined for the field tells me how to format the value when displaying it in a form or using it to populate form fields in Word documents.
Alex replied to Pat Hartman on 09-Jan-11 08:34 AM
Thanks, Pat,

I appreciate your input a lot. Your solution sounds very much like what I have had in mind until I read (parts of) Thinking in Sets by Celko. He strongly argues against storing meta data (like data type of another field) as data. Your (our) approach relies on what he calls One True Lookup Table (OTLT). I find it interesting to hear that your experience with it seems to be positive. (And I imagine that Robbe's system looks similar in terms of how answers are stored.)

But I would like to hear the other side, too. I'm sure there are SQL experts out there, who would hate such a data model. I'm not asking for a complete solution here, just ideas and - most importantly - the rationale behind the ideas.
Pat Hartman replied to Alex on 09-Jan-11 05:41 PM
The choice of this ultra normalized schema was made due to the need for speed in developing new products.  In a more mature company where new products weren't being developed every few months, I would have defined a flat table for common fields and separate tables by product to hold the product fields.  Using this vertical table allowed the user to create new products as the need arose.  Which is the same reason that EAV is the preferred schema for surveys, questionnaires, tests, etc.  Working with entities of this nature is quite simple until you get to the possibility of different answer data types.  No one questions the EAV solution as long as the answers are all of the same data type.  Once the answer data types are different, the questions start.  You could create a separate table for each answer datatype which should satisfy the purists but it will complicate your SQL and be just as awkward for forms and reports.  Having a field for each possible data type simplifies the SQL but does nothing to simplify forms and reports.  No matter which vertical solution you choose (data type field + common value field (my suggestion), separate value field for each data type, separate table for each data type, you end up with awkward processing for the answers.