baldwhiteguy.co.nz

-
Technical Blog


Business Intelligence: The Data Analysis Phase

03October

I think every IT Professional has a specialist subject which they feel most passionate about. Mine is data. If you’re building Business Intelligence or reporting systems, having good quality and well understood data is absolutely paramount. Unfortunately, I’ve lost count of the times when I’ve seen system developers jump straight for their tools - sometimes pushed by business managers and project managers who just want their systems delivered. They focus on building automated data feeds and reports, without truly understanding what the data is, and what it means to the business. This results in a lot of money being spent on something that doesn’t actually work - and which generates misleading reports that destroy everyone’s credibility.

Sometimes you need to go back to basics - and the most basic requirement for any Data Warehouse or BI system is data. Skimping on activities in the data analysis phase greatly increases project risk.

Lack of understanding about data means that analysts and developers may end up:
  • Using data fields and values because they're convenient,
  • Basing systems on their own data assumptions, without doing the "deep dive" to understand what that data really is. And,
  • Glossing over obvious data quality problems on the basis that they're “something to be dealt with later”.
These issues often occur in the rush of getting a system out the door to demonstrate to the execs. Even though data might be raised as an issue at the time, there’s generally an expectation that those fancy new software tools will “auto-magically” sort it out. Sadly, data problems usually come back to bite - and are more expensive to deal with once the system’s already been built.

The tips below represent my recommended approach the project’s Data Analysis Phase. Getting this right doesn’t need investment in fancy tools or specialist developers … and most of the people that you need for this exercise are probably already working for you. The techniques listed below are pretty basic, and should already be second nature for practically any experienced data analyst … but documenting them is hopefully a benefit for anyone new or looking to plan the Data Analysis activities.

1. Project Plan and Resources for the Data Analysis Phase

First and foremost, the Data Analysis phase needs to be planned for. Time and resources need to be dedicated to this activity in the project plan. This activity is separate from the user interview phase of the requirements gathering - and will need its own dedicated resources. It will need:
  1. Resources assigned (Data Analysts, Business Users, and IT Support)
  2. Reasonable period of time for analysis to be completed.
Dedicated data analysts can be recruited specifically for the project - but can also be sourced internally if you have business analysts or developers with the right mix of technical and business skills. In some cases the data analysis might use dedicated data profiling tools (e.g. from Oracle or other providers). However, I’ve seen just as much accomplished by business resources with only basic SQL, but who knew to ask the right questions.

The points that follow should give you a better idea about specific skills and activities required. The ultimate aim is to give your Data Architect and developers something that they can work with. This will allow them to finalize the data model, and to cater for any special scenarios that might need to be handled within your ETL. Information from your data analysis will likely also benefit your Test Manager when defining scenarios and acceptance criteria for use by the test team.

2. Evaluate Organisational Data Complexity

Organisational data complexity will be one of the first indicators of what you’re dealing with - and how much time you might need for the fuller analysis. This is determined by the number of systems involved, and the variations of those systems.

For example:
  • Are you bringing data across from a single ERP?
  • Do you have multiple instances of the same ERP?
    (e.g. used by different regions or divisions), or do you need to integrate data from multiple ERP products?
  • Besides the ERP - are there cross-overs with other systems requiring data to be matched and merged? If so, how complex are these individual systems, and do you have any gut feel about quality and replication of data between these systems?
Greater complexity means that you’ll likely need more time and more resources to complete the analysis. Alternatively, use this information to generate an analysis plan. Consider options such as breaking the analysis effort into small chunks - and prioritising the systems and data that you plan to focus on.

Back in NZ, data complexity for projects or systems that I worked on were relatively simple. Although our bigger systems had a lot of data feeds (sometimes high-volume such as telecommunications call details or SMS records), each feed was limited to a specific subject area. This reduced our need to merge records from multiple sources, or apply overly complex logic. Our organisations tended to be smaller, with greater consistency. By contrast, the organisations that I’ve worked with since coming to Australia have been global. They’ve had different databases and systems in different countries, different languages, and have much greater variation (and replication) due to companies having been built up through acquisitions.

3. Evaluate Data Structures and Data Integrity

Once you have a basic idea of the data complexity, and know what you’ve gotten yourself in for, the next step is to start gathering information about source systems and their data structures. This gives you a high level view to start figuring out what data is where.

Most systems should ideally have Entity-Relationship Diagrams (ERD) available, or maybe a Star Schema if there’s an existing reporting system.
  • These will typically be held by the internal IT support teams. However:
  • In cases where the system has been purchased from an external vendor (and is possibly still managed by that vendor) then you'll need to liaise with that vendor
Okay - you’d think that ERDs would be a no-brainer and that they’d be readily available. I mean, how can a team maintain a system if they don’t actually know the structure of the data contained within it? Sadly, this lack of documentation is more common than you’d think - and getting the info together can take effort and a lot of persistence. Sometimes you’ll be lucky and encounter a team that’s really on the ball and who can give you exactly what you need. In other cases, digging might get you a copy of the original (not necessarily current) design documentation which at least gives you a starting point for investigation. There are also situations where the documentation exists - but its contained within some vendor’s proprietary systems … meaning that you end up getting a bunch of exports needing to be pieced together.

If all else fails, try getting a list of the tables and data fields, highlighting the primary and foreign keys. If the support team or vendors can’t provide this to you, then its something that you can potentially generate for yourself … if you have access to the backend database. Common databases such as Oracle and SQL Server have hidden metadata or Data Dictionary tables which you can query to generate a list of tables and associated fields - including field types and short field descriptions (if recorded). Although this isn’t perfect, it gives you a starting point to begin piecing together your own ERDs. Additional information will also be revealed if you get access to existing SQL queries and reports - but be aware that this will be a time-consuming piece of detective work.

Once you have an idea about the data structures, try a few basic queries (using SQL or whatever toolset you have available). Try verifying joins on the key fields - looking for mismatches (i.e. key values that occur in one table but not another), duplicate values, or other inconsistencies.

Data inconsistency is a greater problem when attempting to join data between systems. For example, you might have two systems containing employee data - but find that the recorded contact details and position titles are inconsistent between the two. This begs the question: which system do you trust?
  • Data integration between systems (e.g. automated data feeds and data reconciliations) will reduce the scale of these issues, if implemented - but this integration hasn't been a priority for every organisation.
  • For instance: I recently worked with an Australian company which had their Australian employee data manually input and independently maintained in six different systems. This didn't include employee data held in operational systems maintained by their overseas offices. Employees who'd been marked inactive in one system frequently still showed as active in the other systems.
The easiest way to resolve these inconsistencies is through the identification of Master Data (i.e. trusted data sources) by subject area and data entity. For example - payroll data might be considered the trusted data source for permanent employees - whereas other systems (e.g. network security and user profiles) might be deemed as the trusted source for contractor data. These systems can be identified through user interviews. In general you should choose the systems whose data is reviewed and updated and updated most frequently … although you may still need to merge data from multiple systems to construct a complete master for some subject areas.

4. Data Profiling

Having identified your data sources, the next step is data profiling. This means getting a more detailed picture about the quality and completeness of data contained within the systems you’re reviewing.

For example:
  • Does it appear that all records are being captured - or does there appear to be missing data?
  • When did the data capture appear to start?
  • Are there gaps or periods of incomplete data?
  • Are there other observable data issues or sudden jumps in trends? These might typically coincide with historic process changes or data migrations, and could impact on the reliability of reports generated from this historic data. Getting an early idea of these issues (and recording them in the metadata) will help you answer later questions from report writers, managers and in training. Hopefully, it will also reduce the risk of this data being misinterpreted in future.
Getting a picture of this data can be difficult, since you don’t know what you don’t know. I usually begin this exercise with a series of basic queries, then home in on specific issues once I identify them.

My initial queries are typically:
  • Does this data look complete? i.e.
    Are all required fields captured in all records (or are there fields that are frequently skipped/incomplete)?
  • How many records are there overall?
  • Try counting record volumes by year (or month). Are these volumes reasonably consistent?
  • When did data-capture appear to start?
  • What volume changes have applied over time?
I then group the data into classification and value (i.e. non-classification) fields.

Classification Fields
Data categories (which might be used to populate Dimensions in a Star Schema or cube) are especially important to delve into, since these have a major impact on reporting capability. They’re highly visible as a means to group, summarise and filter data.
  • What categorisations are used?
  • Which tables are used to define these categorisations (reference data)?
  • Generate a frequency distribution
    (e.g. Using a SQL SELECT Count(*) … GROUP BY)
  • Based on this distribution, look to see how categorisations have been applied.
  • Review categorisations to look for potential duplicates, ambiguities, and how definitions might have changed over time.
  • Are there multiple levels of categorisation (i.e. sub-categories)? If so, you may need to repeat similar analysis for these.
Don’t just consider the specific category headings that you see in the data. Sometimes it’s worth casting the next a little wider to see if there might be similar categorisations or groupings being used elsewhere.

For example:
  • If you're looking at product categories for an internally-managed purchasing or procurement system, consider adding support for a globally-recognised product classifications such as the United Nations Standard Products and Services Code (UNSPC) in addition to your own internally-developed classifications.
  • This is as simple as adding an extra column to your dimension and a heading-by-heading mapping between the internal and external classification (it isn't as accurate as an item-by-item mapping … but is much quicker).
The ability to classify and group your purchases or expenditure by UNSPC (or other externally recognised classifications) gives you the ability to compare your purchases with those of other similar organisations. Providing alternative classification methods within the dimension also saves future report writers and data analysts having to independently create their own conversions.

I used this approach several years ago when developing categorisations for NZ judicial offence codes. The aim of this was to compare NZ criminal offending and conviction statistics with those for Australia and UK. This meant that statisticians needed to be able generate the NZ figures using the Australian and UK classifications … essentially comparing ”apples with apples”. However, doing this was tricky since NZ, Australian Federal/State Governments, and the UK have all independently drafted their own legislation and offence definitions. Fortunately, each country reports its own crime statistics and, in doing so, provides a clear description of the Offence Type definitions that its used for grouping and counting these offences. These definitions help analysts determine where the lines are drawn for offences that may involve elements of multiple offences - e.g. where “aggravated robbery” might fit under the category headings: Violence, Robbery, Theft etc. Having these definitions allowed for Offence Type mappings to be generated between the NZ, Australian and UK classifications - and then loaded onto the offence type dimension so that the statisticians could readily generate their reports using whichever scheme they wanted.

Value (non-classification) Fields
For value fields, look at minimums/maximums and frequency distributions. If there are many distinct values, then group them by ranges.

Consider Null Values, Defaults and Mandatory Fields
Analyse the incidence of null values. Also look to see if any of the fields have default values specified within the original source systems. The problem with default values is that they result in unrecorded data masquerading as real data. These force us to ask: is this a genuinely recorded value, or is the value there because the data-entry operator didn’t have anything to enter (or maybe they just skipped this field)?

Including records with default values can significantly skew your reports with a large number of untrue values. Therefore you need to make a judgment call about the data validity. If there appears to be a large amount of invalid data:
  • Consider options to identify and exclude this invalid data from analysis.
  • Consider options to improve the quality of future data capture - such as changes to data-entry screens and rules in the source systems. These changes can be discussed and negotiated with the source system owners. In addition to defaults, also consider the effect of mandatory fields. Although mandatory fields are supposed to ensure that data is captured, these can potentially also trigger bad behaviours where users enter meaningless "junk" values simply as a means to complete the transaction or progress onto the next screen.
Observations should ideally be recorded in your metadata notes.

Data Source Analysis
The next factor that I look at in my data profiling is the impact that data sources have on the data being collected. For example - if you’re integrating data from different data sources in different countries, the data sources may have different reference data (due to different languages or standards), different text descriptions, and recording formats. As such, it’s important to review the data to identify patterns (and conversions) which might need to be factored into the data integration or standardisation rules.

Section 3 above briefly mentioned the concept of Master Data - i.e identifying the “trusted source” for specific subject areas or data entities. However, try linking equivalent entities from various sources together (and as well as other related tables). Consider:
  • Are the data sources compatible?
  • Are there mismatches (data appearing in one system but not others, or which appear in both systems but is inconsistent?).
  • Compare equivalent categorisation/reference data fields applied across these multiple systems. Are they compatible/consistent?
If there are multiple instances of the same system (i.e. local instances of the same database being hosted in multiple regions), look for data crossovers such as duplicate employee records for employees who’ve move between regions, or duplicates of the same customer data. Rather than by region, duplicates can also occur due to multiple systems existing prior to an acquisition.

5. Analysis for Dimension Tables

Developing the dimension tables is the responsibility of the Data or Solution Architect. Nonetheless, as a data or business analyst, I find it helpful to come up with my own conceptual framework. This helps me get my head around the data structures - and trial different approaches. This gives me the opportunity to test how my conceptual view fits with the reality of the business data that’s been captured.

Dimensions are often linked to the classification fields identified in section 4 above.

Consider Data Hierarchies and Groupings
Identify hierarchies in any categorisations that you’ve applied. For example: a group of products might be classified as vehicles - but under that, there might be more specific subcategories, such as cars, light utility vehicles, heavy trucks, loaders etc.
  • Try to identify the subcategories under each main category.
  • Do some subcategories need further subcategory levels defined?
  • Try to map out the overall grouping hierarchy
  • Identify if there's someone managing and maintaining these groupings. How are they kept up to date?
A classic hierarchical dimension is the customer dimension. A customer can be an individual or single-proprietor business, or it can be a large complex corporate or government agency with many branches and subsidiaries. These agencies may be grouped into multi-level hierarchical structures which are periodically split or merged as the organisation restructures or acquires other organisations. Ideally, this customer hierarchy (which addresses, and contact information) should be maintained centrally to avoid duplication of effort and data inconsistencies.

Linked Attributes
The dimension tables also need to contain attributes associated with that entity. For example, the customer dimension will likely need the customer address (potentially separate physical, mailing, and billing addresses), contacts and phone numbers. There might also be customer type classifications and territories recorded against each customer to help with filtering and grouping.

Consider Dimension Types
Take a close look at the Dimension and consider how often the data changes. For example - if you have a customer dimension then its likely that the customer address and contact information will change from time to time. It’s also likely that the hierarchical structure of corporate customers will change. If you have information within dimension records which are likely to change over time, consider frequency of changes being made, and the implication for your future reporting.

In data modelling speak, this introduces the concept of the Slowly Changing Dimension (or SCD)
  • For the customer contact example: Do you just want to know the current customer contact? This may suffice for most reporting since the current contact is the person that your organisation is likely be dealing with. Hence, if the contact changes (noting that changes should be detected in data loads as data is refreshed from source systems), you might simply want to overwrite the old contact details with details for the new contact. This is quick and simple to implement - and will be sufficient for future reporting. This is known as a Slowly Changing Dimension Type 1 (or SCD1). The only downside with this is that you can't easily go back to see who the customer contact was at some earlier point in time - for example, who the contact was at the time a particular order was placed (unless the contact is also recorded on the order record).
  • If you want to retain the history of changes, or keep a point in time view of the data, then consider using a Slowly Changing Dimension Type 2 (SCD2). Rather than overwriting the contact value within the customer record, you simply create a new version of the record. This contains all details as per the previous version except it contains the current contact. The old customer record version is marked to indicate that it's no longer the latest version. This is usually done using an is_current flag field on the record, and/or adding fields with start/end dates highlighting the period for which the record was valid.
SCD2 dimensions are important for financial reporting since it lets you reproduce financial figures as they looked at a specific point in time (e.g. financial year end) irrespective of any subsequent updates or backdated transactions. For example, one of my first jobs was reporting monthly HR headcount and expenditure information for a bank. Unfortunately, the figures reported for previous months always changed each time I ran the report. This was a constant problem - generally occurring because someone somewhere had put in a backdated sick leave or annual leave (or had corrected an incorrectly entered figure). If I used the real figures generated at runtime then management always asked why the July figure in this month’s report was slightly different than the one in last month’s report. The only way around this was for me to keep a separate manually-maintained worksheet containing the “as reported at month-end” figures. ERP and financial reporting systems achieve this by having their own specific data fields for official month-end values (these usually cover most, albeit not all, reporting requirements)

SCD2s are also important for hierarchies - since you may want to generate reports grouping a customer’s year-end spend into branches and subsidiaries as they existed at that time, irrespective of subsequent structural changes that have happened since year-end. See this link for a more detailed explanation of SCDs.

Other factors you should look at are Ragged Hierarchies (where the number of levels for categorisations and subcategorisation isn’t always consistent - for example, addresses in some countries have States, whereas other smaller countries don’t), and other types such as Conformed Dimensions, Junk Dimensions and role-playing dimensions (see this link). All of these will have an impact on the reporting data model to be developed in section 7 below.

6. Identify Measures (and Fact Tables)

These are the value fields needed for generating the totals, subtotals and calculations requested by business users for their reports. Identifying the measures also helps you identify the Fact tables and the data granularity since these are all closely linked. These are also useful for the design of data cubes if these are to be used as the basis for your reporting.

Methods to identify the measures include:
  • Review existing reports and KPIs.
  • Interview users or business management to find out the indicators or numbers they're most interested in seeing.
  • Analyse existing reports and spreadsheets to determine how (and what) data is extracted, and actual logic for existing calculations. This detailed analysis needs some level of technical expertise for reading SQL and reverse-engineering reports in whatever reporting tool has been used.
  • Interview existing business managers and analysts to document and confirm the business logic that should be applied (compare this with the actual logic determined in the point above).
  • Agree and document the logic that will be used going forward in Fact tables and reports to be developed.
Identifying measures can work out to be the most complex part of your data analysis. Even if the measures are clearly defined in the organisation’s documentation, it may turn out that different analysts and report writers have interpreted them slightly differently, or have gone off to do their own thing. Unravelling this, and bringing everyone back to agreement can be a major exercise. This also has implications for your testing. If your new reports are using the agreed (and slightly different) new formulae and extract logic, then you can’t (easily) verify the new reports by comparing them with figures shown on the old reports.

7. Consider Summarised Data and Data Definitions

Besides the measures identified above, you might also want other tables containing data summarised in pre-defined groupings. Although it should be possible for users to generate their own summaries, having these will save time and be much quicker for users to run. It also saves system resource since the summaries don’t need to be recalculated each time a commonly-used report is run.

Additionally, bring together the data definitions based on your analysis so far, and start getting them into a format that which makes them accessible to users. This doesn’t necessarily require a complex metadata management system. A simple set of notes in a Microsoft Word document or a shared wiki can be a great start. If you have data being shared between organisations, share and discuss these definitions to ensure that you have a common understanding of what they mean.

8. Prototype Your Dimensional Models and Transformations

As per comments in section 5 above, this task belongs to the Solution or Data Architect … although small prototypes built by technically-capable Business Analysts can be helpful during the Data Analysis phase to trial and verify formulae or assumptions. This involves prototyping the proposed target data structures for Fact and Dimension tables (along with proposed transformation logic and calculations).
  1. Where possible, try to pre-calculate your measures (e.g. line-item price = quantity * product price) during the data load processing. These should be stored as simple data values within your Fact table records. These will save you the need to perform calculations at report runtime. It means that reports will be quicker to run, and it reduces the complexity (or replication of logic) within reports. Ideally, the only calculation logic that your reports should need to perform are group summaries such as totals and averages etc.  
  2. Also remember that fact tables don’t need to be fully normalised. Therefore, if it makes sense to replicate dimensional attributes within a fact table (e.g. categorisations for filtering and grouping) then do so. This also reduces the work and data linkages needing to be performed at report runtime and (again) make your reports much quicker to run and develop.
The key thing is that you’re building table structures which are optimised for simple queries and reporting - you’re not trying to minimise the data storage.
  • Start by drawing simple Star Schema models to confirm your target Fact and Dimension tables and relationships. Then,
  • Start prototyping the table structures, and trialling them with sample data. Where possible use sample taken from your source systems.
  • Try generating the calculated fields and categorisation mappings using logic which you've identified so far. Finally,
  • Try prototyping simple queries and reports which your business users have indicated that they want.
Having followed this this process, you can then try answering the questions below:
  • Can your proposed structures generate the type of result needed?
  • Do the results "look" to be right based on the sample data you've used?
  • Are there data quality issues or refinements needed for the data transformation logic and calculations you've identified?
  • How often will the data need to be refreshed and updated from the source system?
  • What's the best means of getting data from the source system into your data warehouse?
I draw up these models and continually refine my prototypes during the analysis phase as I gain experience with the data and reporting requirements. These help reinforce concepts and become a powerful communication tool. They’re useful for communicating with developers and architects - but even small pivot tables in Excel or data cubes in SSAS (real and mock-ups) can help when talking with the business to verify categorisations and groupings or performance measures. This is a lot quicker than waiting for the final reports. It also presents something tangible to help the business visualize the data - rather than hypothetical BI-speak that the business users don’t understand. Having a business analyst with the technical skills to do this will be a great asset to your project.

Note: Although I’ve referred to Star Schemas and Dimensional Modelling, the outcome of your discussion with users (and prototype review) might be that your users really don’t want a Dimensional Model. Don’t try forcing them to use approaches that they really don’t want, or have difficulty using. I’ve encountered this several times - most notably with business analysts, statisticians, or report writers using the SAS suite. In these cases they often just wanted a set of simple “flat file” denormalised datamarts optimised for different reporting purposes. You have to remember to give customers what they want - no matter how much it goes against your training or instincts. Of course, I still keep the Star Schemas in reserve because eventually someone will want data cubes or derived reports.

9. Interview Business Users and Stakeholders about Data Issues and Observations in Your Prototypes

Now that your prototypes are built, review the data and sample reports with your business users (and the original data-entry staff) to identify issues and observations. In some cases, this may be the first time that anyone has looked at the data in detail. Doing this may reveal inconsistencies in data-entry practices, classifications or business procedures. Or, it may reveal issues with your data extracts, ETL process, or data assumptions.

Outcomes of this will be:
  • Improved data extract, ETL routines, and reporting data structures.
  • Data clean-up activities to improve (and fix) the quality of existing data in source systems.
  • Improved data quality checks.
  • Clarification of business process or data-entry instructions (improving the quality of data being entered into source systems).
  • Additional detail in metadata notes, describing known (often historic) data issues, process descriptions, and clarified data definitions.
Truth is that the existing source data is probably very far from perfect - but it’s important to at least know what you’re dealing with.
As part of your BI deployment, you may need to change practices for data capture in the source systems (e.g.making some fields mandatory, changing default values, refining data categorisations etc), develop data monitoring reports and KPIs, and incorporate data usage or definitions in your training and metadata notes.

10. Data Reconciliation and Quality Measures

Having been responsible for maintaining systems and investigating data issues, I place huge importance on having data reconciliations and quality checks built into your automated data loads. There’s nothing worse than
  • Finding out some of your data is corrupted because there was a problem with the data extract you received from a critical source system … one day six weeks ago. Or,
  • Uncovering data extract or loading issues that have been occurring for the past three weeks, which no-one has picked up till now.
Fixing these can be REALLY difficult. It involves trying to identify affected records, and reloading them either by retriggering them from the source system, or generating a special one-off extract. In cases where point-in-time histories are important (e.g. for financial views) - you may find yourself needing to roll the data warehouse back to some earlier point in time, then re-run all the subsequent data loads with corrected data. Whichever approach you follow, its likely to be very time consuming, and creates added pressure for your internal resources and your users.

Therefore, you need be proactive in ensuring the accuracy and completeness of your data. Waiting for users to report errors just isn’t good enough. Your BI Support team need measures in place to detect and correct errors and inconsistencies before users ever notice. These checks should be run as part of your ETL and should generate alerts or messages for your support team in the event that anything needs to be followed up. This won’t entirely eliminate the problem (since some problems may stem from business processes, data-entry, or other source system issues outside your control), but it will reduce your risk.

Building these checks is the responsibility of the original BI Project (or data feed team) since retrofitting them onto existing systems can be time consuming and expensive. However, if you’re in the unfortunate position of having inherited an existing system without these checks, you can still add them - although you may need to do this little by little due to the time, expense, and analysis involved.

Checks can include:
  • Source system record counts and checksums provided with each data feed (compare these with equivalent record counts and checksums on initial import into staging database, and again with datamarts after the ETL has been run).
  • Automated checks for orphan records or record mismatches.
  • Automated checks for invalid values (e.g. out of range).
  • Automated checks for unexpected duplicate values.
  • Automated checks for new/unexpected reference data (master data) values.
You don’t necessarily need a full suite of tests - as these might end up slowing your ETL down to an unacceptable level, but a small set of appropriately targeted tests will save headaches for for support team and users downstream. Aside from picking up issues with data feeds, you’re also likely to pick up problems missed by the original source system including data-entry issues and data fixes gone awry (often created by automated data fixes or one-off updates applied as source systems are upgraded or extended).

Implementing basic checks will help improve your users’ confidence in your systems. Checksums and record counts can also help you deflect user criticisms that the data is “wrong”. In these cases you can prove that your data is consistent with the source system - getting users to liaise with the source system owners to ensure issues are fixed up there (data fixes needed for reporting should ideally be fixed first in the source system, then flow back into the data warehouse as part of the next data refresh).

Also consider how your automated processes detect issues and generate alerts to your support team. Automated emails are one way - but lineage tools and the development of a status dashboard are also worthwhile options. Besides your support team, this visibility of load status and data quality indicators will also be of interest to power users.

Some Useful References:

The following books are useful references for anyone wanting further information.
Aside from Amazon.com, electronic versions of these books are also available on Safari Books Online (which means that I have them all very conveniently accessible via my iPad).

The Data Warehouse Lifecycle Toolkit, 2nd Ed.
Author(s): Ralph Kimball; Margy Ross; Warren Thornthwaite; Joy Mundy; Bob Becker
ISBN-13: 9780470149775
Link to Amazon.com page or Read on Safari Books Online
This is the classic text for Data Warehousing. Aside from an introduction to Dimensional Modelling (Facts, Dimensions, Slowing Changing Dimensions etc), it contains invaluable advice and templates for the analysis, stakeholder management, and other tasks associated with any data warehousing/BI project. Ralph Kimball and Margy Ross’ other book (The Data Warehouse Toolkit) contains a more detailed description of the Dimensional Modelling approach, along with examples to cover specific scenarios they’ve encountered in their years of consulting. There’s also a range of other related books focusing on ETL and helping you leverage the specific capabilities of Microsoft SQL Server or Oracle BI products - although these are more technical and therefore less useful for Data Analysts.

Information Quality Applied: Best Practices for Improving Business Information, Processes, and Systems.
Author(s): Larry English
ISBN-13: 9780470134474
Link to Amazon.com page or Read on Safari Books Online
Detailed coverage about the importance of information quality, and impact to the organisation if data can’t be trusted (albeit somewhat exaggerated IMHO given that company/project examples given didn’t fail solely due to data or information quality issues). Introduces the concept of Total Information Quality Management (TIQM) based on the classic W. Edwards Deeming-inspired Total Quality Management approach.

Data Strategy
Author(s): Sid Adelman; Larissa T. Moss; Majid Abai
ISBN-13: 0321240995
Link to Amazon.com page or Read on Safari Books Online
Very high-level (and, to be honest, I only read the first few chapters) - but a useful framework for helping to explain Data Strategy components to the uninitiated.

Master Data Management and Data Governance, 2nd Ed.
Author(s): Alex Berson; Larry Dubov
ISBN-13: 9780071744584
Link to Amazon.com page or Read on Safari Books Online
Much greater emphasis on Master Data Management side of Data Quality.

To Improve BPM, Go Back to the (Data) Source, Business Performance Management Jan 2011
Authors(s): Linda Imonti, KPMG
Article Link
Andrew Mercer (Bald White Guy)
Andrew Mercer
I'm a Business Intelligence and Data Warehousing consultant based in Brisbane, Australia. I've consulted on or managed several large BI systems in New Zealand, Australia and Latin America.
Stacks Image 9105
Contact Info
Please use the contact form on this site.
Or phone 04 5704 1640 (Australia)
Latest Photo

Old-timer with a machine gun. Vintage War truck in 2015 ANZAC Day. Flag with original Australian (and ANZAC) colours is held in the background.

Stacks Image 11045
Stacks Image 11049
Stacks Image 11047
blog comments powered by Disqus
© 2015 Andrew Mercer