PYDQC: EDA done in one command
Exploratory data analysis (EDA) is the most important step when it comes to working with moderate to large amounts of data. You need to look at the data and understand its features before you can even think about what to do with given data and how to extract information. The most common EDA steps involve taking summary statistics of columns, removing null values, plotting graphs to check for outliers and understand the range of data, and deciding whether to remove outliers or replace them. These are the conscious first steps most data analysts/ scientists take when presented with data.
This is where pydqc comes in, an open-source tool that automates this process by performing data quality checks and presenting the user with an Excel spreadsheet of the analysis for easier understanding. It generates multiple reports that the user can use to understand the type of data, the correlations between any two data points, the range of values within a given column, and many more things. Essentially, this tool does the EDA on your behalf and more and presents the information in a way that is easy to understand. Here is a flowchart of its application on a sample data frame:
The package can identify 4 main types of data :
- Key -> Does not mean anything, but it is a foreign key for other data frames.
- String -> Generally used for categorical data.
- Date
- Numeric -> Generally used for regression-based analysis
It gives a whole range of information about each of these columns:
- Key and String — It gives sample value, unique values, null values, and top 10 values.
- Date — It gives sample values, null values, unique values, minimum and maximum dates, median values, and graph distribution.
- Numeric — It gives sample value, range, unique values, mean, median, and the graph distribution.
To normally get this information from a python data frame, we need to input at least 20 different commands, but with pydqc, you can simplify your entire EDA into one command.
For our use case, movie streaming scenario, pydqc has proved to be very useful for EDA. Pydqyc not only simplified the process of EDA to just a single line of code, but it also gave useful insights into the different attributes of the streaming data.
We tried two commands — infer_schema and data_summary on the pre-processed movie streaming data which was ready for EDA. The output of each of these commands can be seen below -
infer_schema
data_summary (Summary Tab)
data_summary (Rating)
data_summary (Movie)
Thus, by just looking at the information generated by the command, we can infer the total number of movies, total users, and if the data is collected in a different way, we can even gauge the average time spent per user watching a certain movie. All this information is very useful when it comes to building a predictive model, and presenting this information in such an easy to understand way will definitely make the process of deciding the next steps to build a model way easier.
Additionally, this package also has 2 other features, data_compare and data_consist.
Data_compare performs statistical analysis on columns in 2 different tables that can be joined by a common key. This can be very useful when we are differentiating between the test set and the training set. We can visualize the attributes of both these sets by just running a single command, using which we can then analyze the efficiency of the splitting. This will help in overfitting or underfitting the data, and ensure that data is evenly distributed across both sets.
Data_consist, though can be misleading, compares the consistency of data in 2 different tables. i.e. it checks if 2 tables are similar to their data points. This can be very useful if 2 tables need to contain the same data, but are from different sources. For example, movie streaming data from Netflix and Hulu. If we want to perform the same ML analysis, we need to ensure that the final table is consistent for both these sources. Using this tool, we can quickly spot any mistakes in our data gathering, and rectify them immediately.
We can use both of these commands in the future to detect data irregularities between the training set and the production set or the test set. This can be helpful to determine data drift or changes in the data that was unanticipated and which might produce undesired results by the model.