Data Engineering Interview Questions

shorya sharma
5 min readSep 23, 2023

--

In this blog, we’ll delve into some common data engineering questions and solutions, showcasing the techniques and best practices that enable us to harness the power of data effectively. From handling complex file formats to implementing Slowly Changing Dimension (SCD) Type 2 in PySpark, we’ll explore a range of scenarios that data engineers frequently encounter, shedding light on how to overcome these challenges with practical code examples.

Question 1: How would you handle multiple spark transformations that need to be executed in parallel ?

Handling multiple Spark transformations that need to be executed in parallel can be achieved efficiently by utilizing Spark’s built-in parallel processing capabilities. Spark inherently parallelizes operations on distributed data, but you can also take additional steps to optimize and control parallelism for your specific use case.

Parallelism Configuration: Spark provides various configuration options for controlling parallelism:

  • spark.default.parallelism: Sets the default number of partitions for RDDs created through transformations.
  • spark.sql.shuffle.partitions: Configures the number of partitions for shuffling operations (e.g., joins, groupBy).
  • spark.executor.cores: Specifies the number of CPU cores allocated to each executor. Adjust this according to your cluster resources.

Question 2: If you have 10 input files but only 2 are critical and need immediate processing , how would you configure and orchestrate spark jobs for the remaining 8 files ?

To configure and orchestrate Spark jobs for processing a mix of critical (2) and non-critical (8) input files, you can use job prioritization and queueing mechanisms.

Spark provides a way to prioritize jobs using job scheduling queues. You can set job priorities when submitting Spark applications. In this case, you want to prioritize the jobs for the critical files. You can submit high-priority jobs for the critical files and lower-priority jobs for the non-critical files.

# Submit high-priority job for critical files
spark-submit --queue critical_queue your_high_priority_job.py

# Submit lower-priority job for non-critical files
spark-submit --queue default_queue your_low_priority_job.py

You need to configure your Spark cluster with multiple queues (e.g., “critical_queue” and “default_queue”) and set different priorities for these queues in the cluster manager (e.g., YARN or Mesos). The critical_queue should be configured with a higher priority.

Question 3: Let’s say we have a dataframe with 2 columns which are revenue and date , how would you calculate the month wise cumulative revenue using pyspark ?

+-------+------+
|revenue| date|
+-------+------+
| 566.92|13-Jul|
|1842.57|23-May|
|1963.64|07-Nov|
|1501.06|02-Jan|
| 586.19|16-Apr|
|1565.62|04-Dec|
|1946.01|25-Feb|
| 430.92|23-May|
|1762.45|01-May|
| 750.5|03-Jan|
+-------+------+

from pyspark.sql.window import Window
from pyspark.sql.functions import col, to_date, sum, month,expr, date_format, from_unixtime
# Convert the "date" column to a date type
df = df.withColumn("months", month(to_date(col("date"), "dd-MMM")))
# # Create a window specification partitioned by month and ordered by date
window_spec = Window.partitionBy(df["months"]).orderBy(df["months"])

# Calculate the cumulative revenue for each month
df = df.withColumn("cumulative_revenue", sum(col("revenue")).over(window_spec))
df = df.withColumn("month", expr('substr(date, 4,3)')).select("month", "cumulative_revenue").distinct()
# Create a new column "month_abbreviation" to store the month abbreviations
df.show()

+-----+------------------+
|month|cumulative_revenue|
+-----+------------------+
| Jul| 566.92|
| May|4035.9399999999996|
| Nov| 1963.64|
| Feb| 1946.01|
| Dec| 1565.62|
| Apr| 586.19|
| Jan| 2251.56|
+-----+------------------+

Question 4: Schema Copy in SQL, Copying only schema from one table in another table without copying the data.

CREATE TABLE new_table_name
AS
SELECT *
FROM existing_table_name
WHERE 1 = 0; -- Ensures no rows are copied

Question 5: Handling Text file with multiple delimiters in python

John|Doe|25
Jane Smith,30
Bob Johnson|22

If you want to handle text files with varying delimiters, but you want to treat a comma (,) as the delimiter, you can adapt the approach using PySpark to make comma as the delimiter while considering other delimiters dynamically. You can use PySpark's built-in regexp_replace and split functions to preprocess the lines and split them using a comma as the delimiter.

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, regexp_replace, col

# Create a SparkSession
spark = SparkSession.builder.appName("DynamicDelimiterText").getOrCreate()

# Define the file path
file_path = "data.txt"

# Read the text file into a DataFrame
df = spark.read.text(file_path)

# Define a regular expression pattern to match all non-alphanumeric characters except a comma
# This will help identify and replace delimiters with a comma
delimiter_pattern = "[^a-zA-Z0-9,]+"

# Replace any non-comma delimiter with a comma
df = df.withColumn("value", regexp_replace(col("value"), delimiter_pattern, ","))

# Split each line using a comma as the delimiter
df = df.select(split(col("value"), ",").alias("fields"))

# Select the individual columns from the split result
df = df.selectExpr("fields[0] as first_name", "fields[1] as last_name", "cast(fields[2] as int) as age")

# Show the DataFrame
df.show()

# Stop the SparkSession
spark.stop()

Question 5: What is SCD type 2?

SCD Type 2, or Slowly Changing Dimension Type 2, is a common technique used in data warehousing and data modeling to manage historical changes in dimension data over time. It is part of the Kimball methodology for designing data warehouses and is used to maintain historical context and historical versions of dimension records. SCD Type 2 is particularly useful when dealing with slowly changing attributes of entities, such as customer names, addresses, product descriptions, and employee job titles.

Key characteristics of SCD Type 2 include:

1. **Historical Tracking**: SCD Type 2 tracks changes to dimension attributes over time, preserving historical versions of the data. This allows for historical reporting and analysis.

2. **New Records for Changes**: When an attribute value changes, a new record is added to the dimension table with the updated value. The existing record is marked as “expired” or “inactive.”

3. **Surrogate Keys**: A surrogate key is typically used as the primary key of the dimension table to uniquely identify each version of a dimension record.

4. **Effective Dates**: SCD Type 2 records include effective start and end dates to indicate the time period during which each version of the record is valid.

5. **Current Flag**: An additional column, often called a “current flag” or “current indicator,” is used to mark the most current version of a record.

6. **Queries with Historical Context**: Users can perform historical queries to retrieve data as it existed at specific points in time.

SCD Type 2 is often implemented using dimension tables in a data warehouse, where each dimension record corresponds to a unique combination of attributes. When changes occur to dimension attributes, a new record is inserted into the dimension table with a new surrogate key and updated effective dates. The previous version of the record is marked as “expired” or “inactive” by updating its end date.

For example, consider a customer dimension table. If a customer changes their address, SCD Type 2 would insert a new record with the updated address and update the effective end date of the previous record to indicate when it became inactive. This allows historical analysis of customer addresses.

SCD Type 2 is one of several types of slowly changing dimensions, each designed to handle different scenarios of changing data. Other common types include SCD Type 1 (overwrite existing data), SCD Type 3 (maintain limited history), and SCD Type 4 (store historical changes in a separate table). The choice of which SCD type to use depends on the specific requirements of the data warehouse and the nature of the dimension data changes.

--

--

shorya sharma

Assistant Manager at Bank Of America | Ex-Data Engineer at IBM | Ex - Software Engineer at Compunnel inc. | Python | Data Science