Skills You'll LearnSpreadsheet, Data Cleansing, Sample Size Determination, SQL, Data Integrity Show
Reviews
VR Apr 25, 2021 Google makes great professional certifications. All the classes so far have been fantastic but getting our hands dirty with the data was fun and challenging. Look forward to completing this series. NN Aug 26, 2022 Sally is the best instructor of Google data analystics courses so far. Others are also good too. But I really love Sally's teaching way. She is so clear,knowledgable, and passionate. Geat course! From the lesson Verify and report on your cleaning results Taught By
The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function's SQL-like ability can extract specific data within a spreadsheet. For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required. For example, you could generate a list of all customers who bought your company's products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results. The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes! Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query. Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together. Why is important to have a clean data?One of the first lessons I learned about Databases at the Colledge was: “Garbage in — garbage out!”. What does that mean? It means that we cannot simply expect correct results based on inaccurate, incomplete, or missing data. The quality of the data we put in the database will directly affect the quality of the results we get out of it. Here are questions and answers for the 4th course in Google Data Analytics Certification —Process Data from Dirty to Clean. I made them a part of the effective learning process which I briefly describe in my article Effective learning — how to remember everything you learn. These questions are made during the second phase of the SQ3R study technique. The purpose of these questions is to serve as a concept that can and should be expanded as you progress through the course. What does data integrity imply?
In what ways data integrity can be compromised?Data can be compromised every time during:
What are data constraints?Data constraints are criteria that determine the validity.
What to do when there is no data?● gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after collecting more data ● if there isn’t time to collect data, perform analysis on proxy data from another dataset (the most common case) What to do when there is too little data?● do the analysis with the proxy data along with actual data ● adjust the analysis to align with the data you already have What to do with wrong data/ data errors?○ if the reason for wrong data is that requirements were misunderstood, communicate the requirements again. ○ identify errors and if possible, correct them at the source by looking for a pattern in the errors. ○ if you can’t correct errors, ignore the wrong data and go ahead with the analysis if the sample size is large enough, and ignoring errors won’t cause systematic bias. Calculating sample size — terminology
What to remember when determining the size of the sample?
— a larger confidence level — decrease the margin of error, — for greater statistical significance What task should be completed before analyzing data?● determine data integrity by assessing the accuracy, consistency, and completeness of the data. ● connect objectivities to the data to understand how business objectives can be served by an investigation into the data. ● know when to stop to collect data. What makes data insufficient?
How to deal with insufficient data?
What is statistical power?● probability of getting meaningful results from a test ● the larger the sample size the greater the statistically significant results — that’s statistical power. ● SP is usually shown as a value out of one. We need a statistical power of at least 0.8 (80%) to consider the results statistically significant. ● statistically significant means that the results of the test are real and not an error caused by a random chance. How to determine the best sample size?Margin of Error Calculator
sample size calculators require input on:
What to do with the results?The calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. What is the most common cause of dirty data?Human error.
Types of dirty data and consequences
What is a data validation?A data validation is a tool for checking the accuracy and quality of data before adding or importing it. What are the principles of data integrity?
Data cleaning tools and techniquesAlways make a copy of the dataset first! remove unwanting data:
What is a merger?A merger is an agreement that unites two organizations into a single new one. All the data from each organization would need to be combined using data merging. What is data merging?A data merging is a process of combining two or more datasets into a single dataset. Those datasets need to be compatible. What questions do we need to ask while checking compatibility?
What are common data cleaning mistakes?
Top 10 tips to clean up data — Google Workspace Learning Center What are efficiency tools that data analysts use?
Excel basic functions for cleaning data
What workflow can be automated?
we can partially automate:
What are different data perspectives we can apply to our dataset?
Data cleaning verification checklist:● Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset? ● Null data: Did you search for NULLs using conditional formatting and filters? ● Misspelled words: Did you locate all misspellings? ● Mistyped numbers: Did you double-check that your numeric data has been entered correctly? ● Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function? ● Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL? ● Mismatched data types: Did you check that numeric, date, and string data are typecast correctly? ● Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful? ● Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset? ● Misleading variable labels (columns): Did you name your columns meaningfully? ● Truncated data: Did you check for truncated or missing data that needs correction? ● Business Logic: Did you check that the data makes sense given your knowledge of the business? Top 10 tips to clean up data — Google Workspace Learning Center What are the steps to review the goal of the project?
What is the documentation?
What are the advantages of documentation?
What is a changelog?
Difference between changelogs and version history
34. What type of information a changelog should record?
Changelog best practices:
How to group categories in changelogs?
What changes should be captured in the changelog while cleaning the dataset?
Most common errors in data
How do we import data from one sheet to another?
IMPORTRANGE — Google Docs Editors Help
2. with =QUERY(data, “query”, [headers]) function QUERY function — Google Docs Editors Help
Filtering data with the FILTER function=FILTER(range, condition1, [condition2, …]) FILTER function — Google Docs Editors Help
I hope you’ll find this study notes helpful, since these are the most important facts from the course. Using these questions and answers as a concept you can easily build and expand your knowledge upon those simple answers. If you find this usefull, please clap, share, save, follow, so we can help other learners too, and make their learning process easier. Thanks for reading. :) What is the process of tracking changes addition deletion and error during data cleaning?Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them.
What is the data cleaning how we clean the data?How to clean data. Step 1: Remove duplicate or irrelevant observations. Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. ... . Step 2: Fix structural errors. ... . Step 3: Filter unwanted outliers. ... . Step 4: Handle missing data. ... . Step 5: Validate and QA.. Why are pre cleaning steps important to complete prior to data cleaning?It removes major errors and inconsistencies that are inevitable when multiple sources of data are being pulled into one dataset. Using tools to clean up data will make everyone on your team more efficient as you'll be able to quickly get what you need from the data available to you.
What does data cleaning result in?Data cleaning is the process of removing incorrect, duplicate, or otherwise erroneous data from a dataset. These errors can include incorrectly formatted data, redundant entries, mislabeled data, and other issues; they often arise when two or more datasets are combined together.
|