Monday, March 21, 2011

Entity Relation Diagram

Entity-Relationship Dictionary
An entity-relationship model is properly made up of two components: the E-R diagram, and an E-R dictionary.
An E-R dictionary merely lists the entities, attributes, identifiers, and relationships in text form, providing additional explanation of design decisions and details of the problem domain as needed.
Attribute domains should be defined in the E-R dictionary, since there's no place for them on the diagram.
Your goal is to let the diagram do most of the talking. Only document those elements of your design that aren't intuitive or obvious in the E-R dictionary. Any more than that is merely duplicated effort (for both the writer and the reader).
We'll skip E-R dictionaries for our in-class exercises, but one will be required for the Conceptual Model deliverable of the term project. This year's and last year's sample projects will demonstrate what an E-R dictionary looks like.
Attributes
Attributes come in four flavours: single value, multivalue, composite, and derived/calculated.
Single value attributes are the most common variety and contain, as advertised, a single piece of information. Examples: name, date_of_birth, social_insurance_number, student_number, etc.
Multivalue attributes contain multiple simultaneous values. For instance, a CLIENT entity might have a contact_person multivalued attribute that contains more than one contact person's name for a single CLIENT instance. Multivalue attributes are intended to be a shorthand for a one-to-many relationship to another entity, but just end up being confusing. Always model multivalue attributes as one-to-many relationships to external entities.
Composite attributes are merely shorthand notation for groups of related attributes. The address composite attribute is the most common example: just a placeholder for the attributes street_address, city, region, country, postal_code, and so forth. A telephone number might be another composite attribute if your system will later need to keep track of the area code, exchange, line number, and perhaps extension separately. Using composite attributes to simplify models is perfectly acceptible, provided the meaning is clear to anyone trying to understand the model. If there's any potential confusion, document the details in the entity-relationship dictionary.
Attributes, like relationships, may also be optional or mandatory. However, if you find that an entity contains many optional attributes, a redesign might be in order. The diagram to the left below of the PERSON entity shows two optional attributes: MD_certification_date and boat_name. Neither of these are applicable to all persons, and so they must be optional (attributes not marked optional are considered to be mandatory although the distinction may be left for the E-R dictionary). In addition, neither of these attributes is particularly descriptive of the majority of people (and boat_name is not usually something we associate with a person).


The diagram to the right shows the same model redesigned to accommodate the optional attributes. In fact, now there aren't any optional attributes. Instead, we've identified two subtypes of PERSON, DOCTOR and BOAT_OWNER. All doctors must have an MD certification date, so that attribute is now mandatory in the DOCTOR entity. While we could put the attribute boat_name into the BOAT_OWNER entity, that would still seem odd, and doesn't take into account the possibility that a boat owner could own more than one boat. And so, the BOAT entity is created with the boat_name attribute. Note that we don't want to have a subtype relationship directly from BOAT to PERSON, because that would fail the "is a" test: a boat is not a person. The new design doesn't feature any optional attributes or relationships, and that's one of our goals (often unattainable) in database design.
Relationships
Relationship cardinality is often written as "1:1" (one-to-one), "1:M" (one-to-many), and "M:N" (many-to-many).
Many-to-many relationships are written as M:N rather than M:M or N:N to indicate that the numbers involved may have nothing in common (e.g. the many-to-many relationship between COURSE and STUDENT the other day could mean that a student takes 4 courses and a course is taken by 15 students--there's no reason those two numbers need be the same).
Our goal in E-R modelling is to minimize cardinality and maximize participation.
To minimize cardinality, we look for ways to convert M:N relationships into 1:M relationships without breaking the requirements (this may be possible by using multiple 1:M relationships as we did in last week's lab between DOG and DOG_BREED). The same idea can sometimes be used to convert 1:M relationships into 1:1 relationships, but that's less typical.
By "maximizing participation", we're really looking to remove as many occurrances of optional participation as we can. The requirements often restrict us from doing this, but it turns out the fewer optional cases we have to deal with, the better.
One-to-One Relationships
One-to-one relationships are the least common of the three cardinality possibilities. Nevertheless, there are three possibilities whenever we're faced with a one-to-one relationship, as illustrated in the following three diagrams.


In the first diagram, INSTALLER (perhaps a cable or telephone line installer) is a subtype of EMPLOYEE. In other words, an INSTALLER "is an" EMPLOYEE. Subtype relationships are simply 1:1 relationships when you look under the hood. After all, an installer can't be more than one employee, and an employee can't be more than one installer (and may not be an installer at all).
In the second diagram, the model indicates that an installer is assigned a single truck, and a truck is assigned to a single installer or none at all (there may be more trucks than installers). On the surface, this 1:1 relationship with optionality on one side resembles a subtype relationship. However, it fails the "is a" test miserably as an installer is not a truck.
The last diagram shows a mandatory 1:1 relationship between the INSTALLER and CONTACT_INFO entities. This 1:1 relationship is a candidate for merging into a single entity, since neither can exist without the other, and no other entity is related to CONTACT_INFO. The attributes from CONTACT_INFO might as well be moved into INSTALLER, and the CONTACT_INFO entity dispensed with.
Relationship Attributes
Like entities, M:N relationships may have attributes (but only M:N relationships!).
In the diagram below, employees may optionally enroll in a number of benefit packages. For each such enrollment, we want to record the date this occurred, as well as the rate that the employee negotiated for the particular benefit (for whatever reason, they're allowed to do this).


The enrollment_date and negotiated_rate can't be placed in the EMPLOYEE entity, since a single employee instance may enroll in many packages (which one would be recorded in the EMPLOYEE entity?). Similarly, the attributes can't be placed in BENEFIT_PACKAGE either because we wouldn't know to which employee the date and rate values referred. The only remaining choice is to model these as attributes of the relationship between EMPLOYEE and BENEFIT_PACKAGE. That way, the date and rate of each employee's enrollment for each package can be properly recorded.
During normalization, we'll see why only M:N relationships can have attributes.
Recursive Relationships
Recursive relationships are no different from any other type of binary relationship, but they always seem a little tricky to interpret.
Recursive relationships are commonly used to model hierarchies of information, but a recursive relationship doesn't necessarily imply a hierarchy.
The first two of the following diagrams illustrate a very familiar hierarchy: a filesystem directory structure. The recursive relationship PARENT for the first diagram should be read: a folder may have many sub-folders, and a folder has at most one parent folder (the root folder in such a filesystem does not have a parent). The second diagram illustrates the interesting case for *NIX operating system filesystems: a directory is a file, and the root directory's parent directory is the root directory itself (via some slight of hand in the definition of the ".." link).


The third diagram shows that EMPLOYEE participates in two recursive relationships. The MANAGES relationship models a hierarchy of managers and the employees they manage (an employee may manage several other employees, and an employee has a manager or none at all--the president has no manager). The MARRIED_TO relationship, on the other hand, does not model a hierarchy, but instead indicates that an employee may or may not have a single spouse. The MARRIED_TO relationship is interesting because its meaning is identical in both directions.
Ternary Relationships
Most relationships connect two entities. These are called binary relationships. However, sometimes it is necessary to relate three or more entities at the same time. To do this, a ternary relationship is mandated.
However, ternary relationships are not easily modelled or drawn using Crow's Foot notation, and in any case they don't survive the normalization process (relational theory only deals with binary relationships). Therefore, whenever a ternary relationship is called for, we'll slip in an extra entity to help us model it as a set of binary relationships instead.
Consider the scenario modelled using a ternary relationship in the following diagram. A person may take courses in either the role of student or instructor (ROLE is a reference entity with the instances student and instructor). A person can't enroll in a course if we don't know whether they're an instructor or student, and so we need all three entities to participate simultaneously.


The first problem we encounter is how to draw the cardinality and participation symbols on the ternary relationship, especially if we add the requirement that a course must have at least one enrolled student and exactly one enrolled instructor. Where do we draw those symbols? It turns out that Crow's Foot notation doesn't handle this situation adequately.
In any case, we want to redesign the model to only use binary relationships. To do this, we invent a new entity that represents the simultaneous relationship between COURSE, PERSON, and ROLE. In the example below, we'll call this entity ENROLLMENT.


An instance of the ENROLLMENT entity represents a single person enrolled in a single course filling a single role. Because it's an abstract concept to begin with, and also because its relationships with the other entities, ENROLLMENT is very much a weak entity.
Clever designers might see another approach that obviates the need for any ternary relationship:


This design even adds the extra restriction that a course must have a single instructor and at least one student which was missing from the ternary model. The weakness of this design is that it can't accommodate any additional roles (e.g. auditing student, guest lecturer) without a redesign. But if it's known that there never can be any role other than instructor or student, then it works very well.
Now let's expand the requirements to include recording a grade for each student in a course. In the remodelled ternary scenario, this merely consists of adding an optional grade attribute (optional, because instructors do not recieve marks, and also because a mark won't be assigned until the end of the term: optional in this case means both "does not apply", and "unknown" for the two circumstances, respectively).


The non-ternary model also accommodates the grade, in this case as an attribute of the M:N STUDENT relationship. The optionality of the grade attribute now only has the meaning "unknown" as it applies only to students. The grade will be unknown until the end of the term.


Finally, we'll complicate things one more time by introducing a reference entity, LETTER_GRADE. This entity's instances are a list of all of the possible letter grades (e.g. A+, A, A-), as well as the corresponding percentage ranges and GPAs for each. In the remodelled ternary scenario, a 1:M relationship relates ENROLLMENT and LETTER_GRADE. The relationship is optional on the side closest to LETTER_GRADE indicating that an enrollment might be for an instructor (grade does not apply) or for a student before the end of the term (grade is unknown).


It's at this point that the non-ternary model has to break down and introduce its own ternary relationship between COURSE, PERSON, and LETTER_GRADE. Except in this case, the remodelled ternary relationship only represents the student role. In this case, the optional relationship from STUDENT_ENROLLMENT (our "invented" entity) and LETTER_GRADE is also optional, meaning unknown until the end of term (instructors don't take part in STUDENT_ENROLLMENT so we don't have to worry about the other meaning).