Databases – A Research Paper by Steve Edwards 2009

Databases – A Research Paper by Steve Edwards 2009

Introduction

As there are multiple ways in which electronic data can be stored, accessed, altered, and defined, so there are many different types of Databases in existence, storing and manipulating data for a particular purpose. Differences between possible configurations of various electronic Multi-User Databases will be investigated – where access to stored data by one or more individuals "seemingly" simultaneously, is possible – particularly the Centralised, Client/Server, and the Distributed Database models  and some of their characteristics, advantages and disadvantages.

But what is an electronic Database and why is its creation of use?

The online Oxford English Dictionary defines a Database thus:

Database Definition: "database • noun a structured set of data held in a computer."

Compact Oxford English Dictionary, https://www.askoxford.com/concise_oed/database?view=uk, p1.p1 Date accessed 6/11/2009

The key words here are "structured set" as it is this characteristic of the data and how it functions as a structure within a Database model that differentiates it from other electronic data that is just stored collectively as individual, possibly unrelated files on particular electronic media such as hard drives, tape devices etc.

A more holistic, non-technical description and purpose of the Database is given below, again intimating that data structuring and inter-relationships between multiple collections of data are involved and innate, and so purposefully arranged to be analogous to something in reality, and give raw data some meaning so that it can be of practical use:

"…to model the collective user perceptions of the real world. The database reflects an image of the real world; an image that governs the behaviour of people and application processes in the organization."

Gordon C. Everest (1986) Database Management, Singapore: McGraw-Hill Book Co.

To understand more clearly the reasons why structured data is desirable, a distinction between data and information is required as provided by Terence Hanold in 1972, from "An Executive View of MIS" and quoted by Everest:

Information Definition: "Data is transformed into information through the infusion of purposeful intelligence. Thus, information is data refined by intelligence so that it communicates meaning or knowledge…"

Gordon C. Everest (1986) Database Management, Singapore: McGraw-Hill Book Co.

So now there is reason to structure data so that it can be manipulated with a purpose to become information and so useful, so hopefully there is a clearer foundation for the Layperson to continue with.

The precise origins and first use of the term "database" are unclear, but one possibility is that in 1964 the term derived from the joining of the words "data" and "base" to reflect the notion of an original data source file or "Master" file being the "base" to which other data from other sources can be augmented, as referenced from an original study of the involvement of IBM computers in the historical development of Databases, by W. C. McGee.

"Around 1964 a new term appeared in the computer

literature to denote a new concept. The term was "data

base," and it was coined by workers in military information

systems to denote collections of data shared by end users

of time-sharing computer systems"… "As a result, common data items

often appeared in different master files, and the values of

such items often did not agree. There was thus a requirement

to consolidate the various master files into a single

data base which could be centrally maintained"

WC McGee (1981) Data Base Technology, PDF https://domino.watson.ibm.com/tchjr/journalindex.nsf/c469af92ea9eceac85256bd50048567c/18c2b2dadee8a44985256bfa0067f4d8?OpenDocument p1,p505 IBM I. RES. DEVELOP. VOL. 25 0 NO. SEFTEMBER 1981

1.     Discuss the differences between centralised, client server and distributed database configuration.  Explain where it would be appropriate to use these configurations.

To get a foundation from which to proceed, external referenced definitions for each configuration will be used.

Centralised Database Configuration:

To Centralize Definition: "centralize • verb concentrate (control or power) under a single authority"

Compact Oxford English Dictionary, https://www.askoxford.com/concise_oed/database?view=uk, p1.p1 Date accessed 9/11/2009

Distributed Database Configuration:

Distributed Definition: "A logically interrelated collection of shared data (and a description of this data physically distributed over a computer network."

Thomas Connolly, Carolyn Begg, Anne Strachan (1995) Database Systems 2nd ED: Pearson Education

Client/Server Configuration*:

Client/Server Definition: "In the client/server system, the main computer, called a server, is accessible from a network – typically a local area network (LAN) or a wide area network (WAN)"

Andrew S. Tanenbaum (2003) Computer Networks 4th ED, Upper Saddle River, NJ: Pearson Education

*Note that this is NOT a Database type definition but a data transfer process that is a fundamental part of most network Distributed AND Centralised models, to enable access to the data held on the Database Server by a user (the Client).

Centralised and Distributed Databases

We can infer from the prior insert by McGee, that Centralisation was an original desire in early Database design for Administration and data protection purposes but also, networked systems, certainly in the sense that we know them today, did not exist in the 1960s, or the mathematical theory of Relational Data, which C.F. Codd, a mathematician and employee of IBM invented around 1970. This theory was the key to the development of the Relational Database model, from which the term stems, but also to the Distributed Database model over networked systems later, as then it became possible to view data structures in ways other than the original hierarchical, serial, "flat-file" oriented way – namely the network, semantic and relational views being the most common.

A single centralised system database configuration was the only model in existence and being developed originally, until Codd's groundbreaking work, and still exists today as it still has certain advantages over the distributed database configuration.

As this centralised configuration resides in only one physical location, on one machine, under the control of one Database Management System (DBMS), immediate concern in relation to data storage preservation is apparent because of the potential for loss of data that may occur through "local" damage such as fire or flood, as there is no inherent means for total data duplication except locally. Any local back up procedure then relies completely on the removal and safekeeping of local copies to remote locations and keeping acceptable synchronisation timescales between latest versions and backup copies.

However, there are certain situations where Database centralisation may be a requirement or a necessity – to prevent unauthorised access to sensitive information within a secure building environment for example – such as Military establishments or other, where the possibility of physical removal of information and data access locally can be closely controlled and monitored. External network access may even be prevented completely to reduce the potential for data interception on connections outside of the actual building complex.

For any Database, there is an issue with time related data accuracy; essentially meaning "how up to date" or "current" is the most recent Database image in real time.

Another advantage of the centralised model is that data accuracy is upheld in a relatively timely fashion compared to the Distributed, networked Database model – all things being equal – where multiple instances of the same Database may be running on different systems simultaneously – because "concurrent" access is processed locally on one computer only running one instance of the Database, and each local "transaction" (a completed user access that involves the changing of a data value in the database) is processed and completed before the next, concurrent user transaction can occur, so the database always reflects its most recent state of change between transactions to the user, hence a high degree of data accuracy is retained, provided no transaction errors have occurred, that would adversely affect the data "consistency". The more processing power available the better, naturally, so many centralised databases were originally run on Mainframe computers (and still are) to help speed up multi-user transaction times where the user was connected to the system by dumb terminals. Today, the user access to centralised and distributed database systems – controlled by local Database Management System software (DBMS) – is from networked computer clients running Client software in a Server/Client configuration.

There may be less data accuracy between user transactions with a distributed configuration, as there are multiple instances of the database running simultaneously on different systems that (may) each require "consolidation" to a Master database instance at some point, which may not yet have occurred for a number of reasons such as systems inter-connection delay (latency), workload, or system processing time differences between multiple networked systems.

The cost of data accuracy in the centralised system may come at the price of perceived user delay experience as there is only one instance of an available portion of data that can be accessed and changed at a time per user access. Not so with the distributed networked model – because multiple copies of the Database exist, but conversely, the database "image" being accessed here may not be the most "up to date" version that exists across those multiple instances.

Factors such as network connection bandwidth, system processing power and user workload all affect both database models to a greater or lesser extent in general so have to be taken into consideration at the design stage and compromises made depending on the intended role of the Database. Methods have been invented and implemented to improve this simultaneous access/concurrency time delay, whilst retaining acceptable levels of data accuracy.

The Distributed Database model is inherently insecure from an unauthorised access standpoint compared to the Centralised model, mainly because the data is usually stored on multiple networks, so multiplying the potential for unauthorised access to multiple systems, but it has the major advantage of ensuring that complete data loss is minimal as the data can be copied from/to, and stored on, multiple computers automatically that may be geographically dispersed worldwide, also it reduces user transaction delay as there are many instances of the same database serving many clients on multiple systems. Failover redundancy is inherent, as if one system fails another is available in its place.

Note that in all the previous comparisons that these separate configurations can be combined, as here, in a study by Mehmet Savsar and Fawaz S. Al-Anzi, evaluating the "Reliability of Data Allocation on a Centralized Service Configuration with Distributed Servers":

"The centralized server monitors and controls the distributed database servers, which can be remotely separated from each other and from the center by almost any distance…"

https://comjnl.oxfordjournals.org/cgi/content/abstract/49/3/258?maxtoshow=&HITS=10&hits=10&RESULTFORMAT=1&andorexacttitle=and&andorexacttitleabs=and&fulltext=database&andorexactfulltext=and&searchid=1&FIRSTINDEX=10&sortspec=relevance&fdate=1/1/2004&resourcetype=HWCIT

The Client/Server Configuration

This "data transfer" model applies fundamentally to the Distributed Database configuration, as the term is a general networked computer data access process definition, rather than a Database model in itself. The model is indifferent to whether the client (user) is physically "local" or "remote" or what media (network cable, wireless etc.) is utilised to connect the client to the server. This model also applies to a relatively recent situation that has evolved, thanks to large amounts of memory, storage and processing power that are available for relatively very low cost in personal computers or laptops, where the database server software and the client software can reside on the same physical machine.

Historically within the evolution of Database systems, this combination was not possible initially, as computer hardware was extremely expensive, of specific design, and the Operating System software for running both a server and client simultaneously on one computer did not exist. This data transfer model exists wherever there is data stored that requires accessing over a network by a user, whether a Database is involved or not.

As there are many versions of both server Database Management Systems (DBMS) ("back-end") software, and client "query" ("frontend") software available for free now, there are major benefits for the individual. A student for example, can take both roles of user and administrator for different "flavours" (SQL, MySQL, MS Access etc.) of Database query and management tools, enabling the construction, administration and access to their own database relatively simple and cheap.

2.     Describe how you would tackle the problem of ensuring data consistency and concurrency.

Data Concurrency

Concurrency was touched on earlier in respect to the access of data by users during simultaneous transactions, but not defined.

Concurrency Definition "Concurrency is the ability of processes to interact with the data without encountering interference from concurrent processes. If the probability of interference is low, then the level of concurrency is said to be high. Increased concurrency translates into increased sharability and availability."

Gordon C. Everest (1986) Database Management, Singapore: McGraw-Hill Book Co.

As stated earlier, user experience is directly affected by any transaction delay, yet ultimately, data accuracy is paramount, so the user experience will always be secondary to the integrity and accuracy of the data. Can any improvements be made in this respect, and if so how is it achieved?

"Concurrent access control generally takes the form of

data locking, i.e., giving a user exclusive access to some

part of the data base for as long as necessary to avoid interference."

WC Mcgee (1981) Data Base Technology, PDF https://domino.watson.ibm.com/tchjr/journalindex.nsf/c469af92ea9eceac85256bd50048567c/18c2b2dadee8a44985256bfa0067f4d8?OpenDocument p1, p515 IBM I. RES. DEVELOP. VOL. 25 0 NO. SEFTEMBER 1981

There are levels of "granularity" that can be imposed during the access to data and to what degree a portion of data being accessed is "locked out" to further transaction attempts while the current access is occurring, that effect the processing ability (overhead) of the system overall. For example, if the default level of lockout granularity is to lock the whole file (as in a network shared Word or PDF document), even if only a small section of it is being changed, then the granularity of locking is high, which translates to low system overhead, but it may be very inconvenient if another part of the file requires altering by someone else simultaneously. This analogous situation is unacceptable in a database environment as a database may contain thousands or even millions of related Tables and associated records, so the locking of a single related Table may cause an unacceptable level of transaction delay, as other related Tables may need to be locked until the transaction is complete. Obviously the degree of granularity for data-locking for databases needs to be much finer if required, at least to the individual record within a file level, or less, but the cost for this finer granularity is increased system overhead as the system has to identify smaller sets of data and hold each set in memory, and it may be that if multiple records within a file require locking that there would be less system resources used in just locking the whole file anyway.

If a system has sufficient hardware resources available though, the finer the granularity of locking is desirable, as more of the whole database is available for further transactions to continue unhindered.

There are two aspects to the locking process, which depend on whether a concurrent transaction request intends to write to the same data presently being accessed by the current transaction, or whether it intends to only read from it. This will decide whether or not and HOW the CURRENT transaction makes a request to the system to lock the data that it is currently accessing, to prevent potential interference from the CONCURRENT transaction. If it is a READ only request from the CONCURRENT transaction then the data is not locked from it, as only when data is potentially being altered by the CONCURRENT transaction is there a risk of data accuracy and integrity being threatened by CONCURRENT interference.

Data Consistency

"IMS protects against data inconsistency…by recording all data base

changes made by a program in a dynamic log. If the

program reaches a synchpoint, its dynamic log entries are

discarded, thereby committing its data changes. If the

program ends abnormally…the system… uses the dynamic

log to back out all data base changes…since its most recent synchpoint.

WC Mcgee (1981) Data Base Technology, PDF https://domino.watson.ibm.com/tchjr/journalindex.nsf/c469af92ea9eceac85256bd50048567c/18c2b2dadee8a44985256bfa0067f4d8?OpenDocument p1,p505 IBM I. RES. DEVELOP. VOL. 25 0 NO. SEFTEMBER 1981

Data consistency is achieved by the generation of "logs" that keep a record of every transaction event that takes place, and the logs used in conjunction with periodic Full backups of the Database. In the event of Database failure or corruption, the Database can then be rebuilt up to the point of failure just from the last full back up, and then replaying the transaction logs since the last full backup, up to the point of failure. For possible drive failure/corruption reasons, the log files should be stored on a different drive than the current Database instance itself, and preferably the log drives "mirrored" so there is always an automatic duplicate created of the logs – as they are the only most recent versions in existence – for the same single drive failure reasons mentioned.

3.     Explain what user grants and revokes are.

In Data Control Language (DCL), which is a subset of Server Query Language, a "Grant" is the allowance privilege given to a user or a "role" (e.g. an Administrator) for carrying out specific tasks, usually at a higher security level than their "normal" user level would inherently allow, so the user/role can make fundamental permanent changes to a Database.

A "Revoke" is the action of reversing previously allowed Grants by removing the permission of the user or role to perform that action any longer.

In MySQL, these commands are usually of the form:

"GRANT privileges ON db_name.* TO username IDENTIFIED BY ´password´"

Larry Ullman (2003) MySQL, Berkley, CA: Peachpit Press

It is important to note the use of the inclusion of the "IDENTIFIED BY ´password´" addition to this command line for security reasons, and that the GRANT command also creates a new user account at the same time. If the GRANT is to be issued to a networked user then it is required to name the user and include the networked identity of the machine, usually by its IP address that takes the form "username@192.168.1.x". Some specific commands in MySQL that an Administrator can grant or revoke rights to a user for are:

"SELECT – Read rows from tables

INSERT – Add new rows of data to tables

UPDATE – Alter existing data in tables

DELETE – Remove existing data in tables

CREATE – create new tables or databases

DROP – Delete existing tables or databases"

Larry Ullman (2003) MySQL, Berkley, CA : Peachpit Press

4.     Define data integrity and explain how it may be implemented.

Data Integrity

"Incomplete and inaccurate data will destroy people´s confidence in the database….when a firm faces bankruptcy because destroyed data cannot be reconstructed, there is strong motivation to install integrity control mechanisms, both system control and administrative control"

Gordon C. Everest (1986) Database Management, Singapore: McGraw-Hill Book Co.

As a fundamental reason to create a Database is to "model" a real world "entity" whether it is an object, person or idea etc., by assigning one or more forms of descriptive identifiers to the object to be modelled – StudentName, StudentID and so on – these identifiers have to conform to the formats that are available for "containing" the data within the limitations of the Database Management System software that translates to operate on top of the underlying computer system and are finite and of the forms that the computer can "understand" – "Strings", integers, "characters" etc. These are assigned in the Table design and creation phase, so that ONLY that particular legal form can apply to that specific datum that makes up part of the "description" of the entity. For example, if an integer is assigned as the available form for the StudentID field, then ONLY an integer type value should exist in present and future occurrences of that particular column descriptor. If any other type can be entered in this field, either deliberately or erroneously, then the integrity of the data in the Table will be lost, and the original purpose of the Database to deliver what was intended, will be in question.

Data Integrity Definition "continued , ensured existence (of data) in the same format that was originally defined for it and no other".

Integrity of data can be ensured by various types of cross-checking, and at the Table creation phase, as the datum type is defined at that point for each particular column. Checks can occur across other data fields in the same or related Tables where there is a relationship between them and so the expectation for a particular datum type in a particular field should correspond. These checks come in four categories: Domain (Column) integrity, Entity (Row) integrity, Referential (Table/cell to Table/cell) integrity and User-defined (Business Model) integrity.

Domain (Column) Integrity has been explained above generally, but a graphical representation may be helpful, as we can see that the Column descriptor can only be of the data type set for each column – integer, variable character – and also demonstrate how row and referential integrity between same and multiple Tables and cells can work. A relationship between the two tables – a Primary Key – that chooses a single commonality between them to enable a "uniqueness" trait between them that can only exist for one individual student, Homer Simpson, in both tables which is his individual StudentID description

Table1 Students

StudentID (int)

StudentName(varchar < 60)

CourseID(int)

Campus (varchar < 30)

001

Homer Simpson

Table2 Courses

CourseID (int)

CourseName (varchar < 30)

Course Start Date (char)

StudentID (int)

54

Nuclear Physics

001

Row Integrity is commonly checked using the above method of Primary Key or a Unique field that links two separate Tables, as this checks that the format of the table is correct horizontally, as these keys may be in different Columns in these two tables as in the above example. The two data fields are chosen that reflect the "uniqueness" of the student – the StudentID number and StudentName combination would be ideal as the Primary Key, as even two students sharing the same name would have unique StudentID numbers so even when relating across tables an individual student can be uniquely identified.

Referential integrity checks can be done between data cells when they are linked in some way so have a common reference – maybe by a calculation on two fields and a resulting total, as in an Excel spreadsheet for example – where the computed value should only be within a certain limit, so that the row integrity checks for that result are within the designed acceptable range, or where the input to a data field can only be of one or two letters – M or F for a gender typing scenario – so any other character except those two are unacceptable and would indicate data error.

User-defined integrity (Business Model) checks can be specified depending on the attribute required within a cell value and defined by the user, say an email address that must contain a "@" character else report an error condition.

From combinations of the four integrity check types possible within Tables, it can be seen that there are many possibilities for checking the overall integrity of data being entered into and stored correctly in Databases Tables.

5.     Explain how applications can be developed using proprietary database tools.

Application Development

"Applications developers translate software requirements into concise and robust programming code. Most will specialise in a specific development environment, e.g. computer games or e-commerce, and will have in-depth knowledge of a few relevant computer languages. The role involves writing specifications and designing, building, testing, implementing and sometimes supporting applications using computer languages and development tools."

Gordon C. Everest (1986) Database Management, Singapore: McGraw-Hill Book Co.

The Database Management System provides the fundamental components with which to develop specific applications tailored to the user needs that suit their current Graphical User Interface and probable lack of programming language knowledge to enable the Application Developer to combine the required elements from low level and high level languages to create an appropriate application that fulfils their needs whilst shielding the user from the underlying complexity involved.

This involves encapsulating what could be many cryptic lines of low level code or commands that are required to access and manipulate the database within something as easy as a mouse click on a menu or similar, in the users´ modern graphic environment. The main difference for the developer of this higher level application, is in the way the database as a complete structure in itself is viewed, so that complex data manipulation commands can be constructed from the lower level language syntax that is required for incorporation into the new application.

Original proprietary database tools are combinations of programs and Data Manipulation language commands that were originally developed (usually) in house by large corporations like IBM who invested in, researched and developed Databases originally, so are their corporate property. This means that development of any applications by third parties, where access and usage rights to these tools and programs are required to develop other applications is usually done under some form of license agreement.

It is possible to download a Database developer's kits from many DBMS system vendors that incorporate package deployment licensing.

As with any program design, user input is required to establish what tasks the application would need to perform, and which may be desirable. Input from a Database Administrator would be invaluable in the applications development as he or she should be the one with the most knowledge of the systems base programs and tools and what various users would require in the application for querying the database.

Bibliography

  • Gordon C. Everest (1986) Database Management, Singapore: McGraw-Hill Book Co.

  • Michael J. Hernandez (1997) Database Design for Mere Mortals, Reading, MA : Addison & Wesley

  • Larry Ullman (2003) MySQL, Berkley, CA: Peachpit Press

  • Chris Fehily (2002) SQL, Berkley, CA: Peachpit Press

  • Miriam Liskin(1987) Microsoft Access 97 for Windows – Superguide, Emeryville, CA: Macmillan

  • Andrew S. Tanenbaum (2003) Computer Networks 4th ED, Upper Saddle River, NJ: Pearson Education

  • Ronald R. Plew, Ryan K. Stephens (2000) Teach Yourself SQL 2nd ED, Sams Publishing

References