Question 6 what is the process of tracking changes, additions, deletions, and errors during data cleaning?

Skills You'll Learn

Spreadsheet, Data Cleansing, Sample Size Determination, SQL, Data Integrity

Índice Show

Skills You'll Learn
Why is important to have a clean data?
What does data integrity imply?
In what ways data integrity can be compromised?
What are data constraints?
What to do when there is no data?
What to do when there is too little data?
What to do with wrong data/ data errors?
Calculating sample size — terminology
What to remember when determining the size of the sample?
What task should be completed before analyzing data?
What makes data insufficient?
How to deal with insufficient data?
What is statistical power?
How to determine the best sample size?
What to do with the results?
What is the most common cause of dirty data?
Types of dirty data and consequences
What is a data validation?
What are the principles of data integrity?
Data cleaning tools and techniques
What is a merger?
What is data merging?
What questions do we need to ask while checking compatibility?
What are common data cleaning mistakes?
What are efficiency tools that data analysts use?
Excel basic functions for cleaning data
What workflow can be automated?
What are different data perspectives we can apply to our dataset?
Data cleaning verification checklist:
What are the steps to review the goal of the project?
What is the documentation?
What are the advantages of documentation?
What is a changelog?
Difference between changelogs and version history
Changelog best practices:
How to group categories in changelogs?
What changes should be captured in the changelog while cleaning the dataset?
Most common errors in data
How do we import data from one sheet to another?
Filtering data with the FILTER function
What is the process of tracking changes addition deletion and error during data cleaning?
What is the data cleaning how we clean the data?
Why are pre cleaning steps important to complete prior to data cleaning?
What does data cleaning result in?

Reviews

5 stars
85.14%
4 stars
12.11%
3 stars
1.91%
2 stars
0.42%
1 star
0.39%

Apr 25, 2021

Google makes great professional certifications. All the classes so far have been fantastic but getting our hands dirty with the data was fun and challenging. Look forward to completing this series.

Aug 26, 2022

Sally is the best instructor of Google data analystics courses so far. Others are also good too. But I really love Sally's teaching way. She is so clear,knowledgable, and passionate. Geat course!

From the lesson

Verify and report on your cleaning results

Taught By

Google Career Certificates

The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function's SQL-like ability can extract specific data within a spreadsheet.

For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required.

For example, you could generate a list of all customers who bought your company's products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results.

The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes!

Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query.

Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset.

For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together.

Why is important to have a clean data?

One of the first lessons I learned about Databases at the Colledge was: “Garbage in — garbage out!”. What does that mean? It means that we cannot simply expect correct results based on inaccurate, incomplete, or missing data. The quality of the data we put in the database will directly affect the quality of the results we get out of it.

Here are questions and answers for the 4th course in Google Data Analytics Certification —Process Data from Dirty to Clean. I made them a part of the effective learning process which I briefly describe in my article Effective learning — how to remember everything you learn.

These questions are made during the second phase of the SQ3R study technique. The purpose of these questions is to serve as a concept that can and should be expanded as you progress through the course.

What does data integrity imply?

accuracy
completeness
consistency
trustworthiness

In what ways data integrity can be compromised?

Data can be compromised every time during:

replication — a process of storing data in multiple locations that can cause those data to be out of sync and inconsistent,
transfer — a process of copying data from one storage device to memory or from one computer to another. If this transfer is interrupted, this can compromise the integrity and load an incomplete dataset.
manipulation — changing the data to make it more organized and easier to read.
human error, malware, hacking and system failures.

What are data constraints?

Data constraints are criteria that determine the validity.

datatype — date, number, bool…
data range — predefined min and max values
mandatory — can’t be left blank or empty
unique — can’t have duplicates
regex — values must match a prescribed pattern
cross-field validation — condition for multiple fields
primary key — value unique per column
foreign key — values for a column must be unique from a column in another table
set-membership — values for a column must come from a set of discrete values
accuracy — the degree to which the data conforms to the actual entity being measured or described
completeness — the degree to which the data contains all desired components or measures
consistency — the degree to which the data is repeatable from different points of entry or collection.

What to do when there is no data?

● gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after collecting more data

● if there isn’t time to collect data, perform analysis on proxy data from another dataset (the most common case)

What to do when there is too little data?

● do the analysis with the proxy data along with actual data

● adjust the analysis to align with the data you already have

What to do with wrong data/ data errors?

○ if the reason for wrong data is that requirements were misunderstood, communicate the requirements again.

○ identify errors and if possible, correct them at the source by looking for a pattern in the errors.

○ if you can’t correct errors, ignore the wrong data and go ahead with the analysis if the sample size is large enough, and ignoring errors won’t cause systematic bias.

Calculating sample size — terminology

population — the entire group
sample — a subset of a population
margin of error — sample’s results are expected to differ from what the result would have been for the entire population. This difference is a margin of error.
confidence level — how confident are you in a survey result. The confidence level is targeted before we start our study because it will affect how big the margin of error is.
confidence interval — the range of possible values that the population’s result would be at the confidence level of the study. This range is a simple result +- the margin of error.
statistical-significance — the determination of whether your result could be due to random chance or not.

What to remember when determining the size of the sample?

don’t use a sample size of less than 30
the confidence level most commonly used is 95%, but 90% can work in some cases
increase the sample size for:

— a larger confidence level

— decrease the margin of error,

— for greater statistical significance

What task should be completed before analyzing data?

● determine data integrity by assessing the accuracy, consistency, and completeness of the data.

● connect objectivities to the data to understand how business objectives can be served by an investigation into the data.

● know when to stop to collect data.

What makes data insufficient?

coming from only one source
continuously updated and is incomplete
is outdated
is geographically limited

How to deal with insufficient data?

identify trends within the available data
wait for more data if time allows
discuss with stakeholders and adjust their objectives
search for the new dataset

What is statistical power?

● probability of getting meaningful results from a test

● the larger the sample size the greater the statistically significant results — that’s statistical power.

● SP is usually shown as a value out of one. We need a statistical power of at least 0.8 (80%) to consider the results statistically significant.

● statistically significant means that the results of the test are real and not an error caused by a random chance.

How to determine the best sample size?

Margin of Error Calculator

The sample size is a part of the population that is representative.

sample size calculators require input on:

confidence level — the probability that the sample accurately reflects the greater population (90–95% are minimum)
the margin of error — how close the sample size results are to what our results would be if we used the entire population
and population size.

What to do with the results?

The calculated sample size is the minimum number to achieve what you input for confidence level and margin of error.

What is the most common cause of dirty data?

Human error.

typing in a piece of data incorrectly
inconsistent formatting
blank fields
duplicates

Types of dirty data and consequences

duplicated data — skewed metrics or analysis
outdated data — inaccurate insights, decision-making and analytics
incomplete data — decreased productivity, inaccurate insight
incorrect/inaccurate data — inaccurate insight and decision making
inconsistent data — contradictory data points lead to inability to clasify or segment data

What is a data validation?

A data validation is a tool for checking the accuracy and quality of data before adding or importing it.

What are the principles of data integrity?

Validity — the concept of using data integrity principles to ensure measures conform to defined business rules or constraints
Accuracy — the degree of conformity of a measure to a standard or a true value
Completeness — the degree to which all required measures are known

Data cleaning tools and techniques

Always make a copy of the dataset first!

remove unwanting data:

remove duplicates
remove irrelevant data (that doesn’t fit a problem we’re trying to solve)
remove extra spaces and blanks
fix misspellings, inconsistent capitalization, incorrect punctuation, typos
use spellcheck, autocorrect, and conditional formatting, and convert text to lowercase, uppercase, or proper case.

What is a merger?

A merger is an agreement that unites two organizations into a single new one. All the data from each organization would need to be combined using data merging.

What is data merging?

A data merging is a process of combining two or more datasets into a single dataset. Those datasets need to be compatible.

What questions do we need to ask while checking compatibility?

do we have all the data we need? -do datasets give me the info to answer business questions/solve a business problem?
does the data we need exist within these datasets?
do the datasets need to be cleaned?
are datasets cleaned to the same standard?
how are missing values handled?
how recently was the data updated?

What are common data cleaning mistakes?

no checking for spelling errors
forgetting to document errors
no checking for misfield values (when values are entered into a wrong field)
overlooking missing values
looking at a subset of data and not the whole picture
losing track of the business objectives
not fixing the source of an error
not analyzing the system prior to data cleaning (to figure out where errors come from — data entry, lack of formats, duplicates…)
not backing up the data before data cleaning
not accounting for data cleaning in deadlines/process

Top 10 tips to clean up data — Google Workspace Learning Center

What are efficiency tools that data analysts use?

conditional formatting
removing duplicates
formatting dates
fixing text strings and substrings
splitting text to columns (data — split text to columns)

Excel basic functions for cleaning data

A function is a set of instructions that performs a specific calculation using the data in a spreadsheet

COUNTIF (range, condition) — returns the number of cells that match specific criteria
LEN (range) — returns the length of a text
LEFT/RIGHT — gives us the set number of characters from the left/right side of the string
MID — returns a segment from the middle of the text
CONCATENATE — combines 2 or more text strings
TRIM — removes leading, trailing, and repeating spaces

What workflow can be automated?

workflow automation is the process of automating parts of your work.

modeling the data — creating DB structure from diagrams, creating business-specific infographics, diagrams, data visualizations, and flowcharts

we can partially automate:

preparing and cleaning data — some tasks like detecting missing values
data exploration — some tasks like visualization

What are different data perspectives we can apply to our dataset?

sorting — we can easily find duplicates
filtering — for showing only the data that meet specific criteria
pivot table — data summarization tool for sorting, grouping, counting, total and average data
VLOOKUP — searches for a certain value in a column
plotting — putting data in a graph, chart, or other visual

Data cleaning verification checklist:

● Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?

● Null data: Did you search for NULLs using conditional formatting and filters?

● Misspelled words: Did you locate all misspellings?

● Mistyped numbers: Did you double-check that your numeric data has been entered correctly?

● Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?

● Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?

● Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?

● Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?

● Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?

● Misleading variable labels (columns): Did you name your columns meaningfully?

● Truncated data: Did you check for truncated or missing data that needs correction?

● Business Logic: Did you check that the data makes sense given your knowledge of the business?

Top 10 tips to clean up data — Google Workspace Learning Center

What are the steps to review the goal of the project?

confirm the business problem
confirm the goal of the project
verify that data can solve the problem and is aligned with the goal

What is the documentation?

The documentation is a process of tracking changes, additions, deletions, and errors during data cleaning. It’s staged chronologically and provides a real-time account of every modification.

What are the advantages of documentation?

it lets us discover data cleaning errors,
it is a way to inform other users of changes that have been made,
and it helps us to determine the quality of the data

What is a changelog?

A changelog is a file containing a chronologically ordered list of modifications made to a project

in spreadsheets — File / Version history
in SQL — specify exactly what and when you commit a query, or just add comments as you go while cleaning data in SQL

Difference between changelogs and version history

A changelog can build on the automated version history. Version histories record what was done in a data change, but don’t tell us why. Changelogs help us understand the reasons changes have been made.

34. What type of information a changelog should record?

data, file, formula, query, or any other component that changed
description of that change
date of the change
a person who made that change
a person who approved that change
version number
reason for the change

Changelog best practices:

changelogs are for humans, so write clear
every version should have its own entry
each change should have its own line
group the same changes
versions should be ordered chronologically, latest to newest
the release date of each version should be noted

How to group categories in changelogs?

All changes usually fall into one of the following categories and should be grouped together:

Added — new features introduced
changed — changes in existing functionality
deprecated — features about to be removed
removed — features that have been removed
fixed — bug fixed
security — lowering vulnerability

What changes should be captured in the changelog while cleaning the dataset?

treated missing data
changed formatting
changed values or cases for data

Most common errors in data

human mistakes — mistyping or misspelling
flawed processes — poor design or survey form
system issues — older systems integrate data incorrectly

How do we import data from one sheet to another?

with =IMPORTRANGE(spreadsheet_url, range_string) function

IMPORTRANGE — Google Docs Editors Help

data are automatically updated,
more efficient than copying and pasting on a large set of data
reduce the chance of errors,
helpful for data cleaning because we can pick the data relevant to the project
if we want to share data, we need to allow access first time

2. with =QUERY(data, “query”, [headers]) function

QUERY function — Google Docs Editors Help

extract specific data within a spreadsheet
faster than filtering manually
can be combined with other functions for more complex calculations
we can use a simple SQL statement to extract specific data

Filtering data with the FILTER function

=FILTER(range, condition1, [condition2, …])

FILTER function — Google Docs Editors Help

FILTER function is fully internal to a spreadsheet and doesn’t require the use of query language.
it lets us view only data that meet our criteria
faster than the QUERY function

I hope you’ll find this study notes helpful, since these are the most important facts from the course. Using these questions and answers as a concept you can easily build and expand your knowledge upon those simple answers.

If you find this usefull, please clap, share, save, follow, so we can help other learners too, and make their learning process easier. Thanks for reading. :)

What is the process of tracking changes addition deletion and error during data cleaning?

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them.

What is the data cleaning how we clean the data?

How to clean data.

Step 1: Remove duplicate or irrelevant observations. Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. ... .

Step 2: Fix structural errors. ... .

Step 3: Filter unwanted outliers. ... .

Step 4: Handle missing data. ... .

Step 5: Validate and QA..

Why are pre cleaning steps important to complete prior to data cleaning?

It removes major errors and inconsistencies that are inevitable when multiple sources of data are being pulled into one dataset. Using tools to clean up data will make everyone on your team more efficient as you'll be able to quickly get what you need from the data available to you.

What does data cleaning result in?

Data cleaning is the process of removing incorrect, duplicate, or otherwise erroneous data from a dataset. These errors can include incorrectly formatted data, redundant entries, mislabeled data, and other issues; they often arise when two or more datasets are combined together.

Question 6 what is the process of tracking changes, additions, deletions, and errors during data cleaning?