Meesho PySpark Interview Questions for Data Engineers in 2025

Preparing for a PySpark interview? Let’s tackle some commonly asked questions, along with practical answers and insights to ace your next Data Engineering interview at Meesho or any top-tier tech company.

1. Explain how caching and persistence work in PySpark. When would you use cache() versus persist() and what are their performance implications?

Answer:

Caching: Stores data in memory (default) for faster retrieval.
- Use cache() when you need to reuse a DataFrame or RDD multiple times in a session without specifying storage levels.
- Example:
```
python
df.cache()
df.count()  # Triggers caching
```
Persistence: Allows you to specify storage levels (e.g., memory, disk, or a combination).
- Use persist() when memory is limited, and you want a fallback to disk storage.
- Example:
```
python
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count()  # Triggers persistence
```

Performance Implications:

cache() is faster but risks failure with limited memory.
persist() is more flexible but may introduce disk I/O overhead if data doesn’t fit in memory.

2. Writing data to distributed storage like HDFS or Azure Data Lake

Answer:
To optimize write operations:

For Small Files: Use coalesce() to reduce the number of output files.

python
df.coalesce(1).write.format("parquet").save("hdfs://path/to/output")

For Large Files: Use repartition() to increase parallelism.

python
df.repartition(10).write.format("parquet").save("hdfs://path/to/output")

Key Optimization Tips:

Use efficient formats like Parquet or ORC.
Enable compression (e.g., Snappy).
Avoid writing too many small files by tuning partitions.

3. How does PySpark handle job execution, and what is the role of the DAG (Directed Acyclic Graph)?

Answer:

DAG: Represents the sequence of transformations in a PySpark job. It breaks the computation into stages and tasks.
Stages and Tasks:
- Stages are based on shuffle boundaries.
- Tasks are the smallest units of execution and run on individual partitions.

Execution Flow:

PySpark builds the DAG based on transformations.
The DAG Scheduler divides it into stages.
Tasks within a stage are executed in parallel.

Example:
For a simple transformation:

python
df = spark.read.csv("data.csv")
df = df.filter(df["value"] > 10)
df.show()

PySpark creates a DAG with read, filter, and show as nodes, breaking it into stages if shuffles are needed.

4. PySpark code to find the top N most frequent items in a dataset

Answer:

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

spark = SparkSession.builder.appName("TopN").getOrCreate()
df = spark.read.csv("data.csv", header=True)

# Find top N items by frequency
top_n = df.groupBy("item").agg(count("*").alias("count")) \
          .orderBy(col("count").desc()) \
          .limit(10)
top_n.show()

Optimization Tips:

Use columnar storage formats like Parquet.
Ensure partitions are balanced.

5. How does partitioning affect performance, and how to determine optimal partitions?

Answer:

Impact of Partitioning:
- Too few partitions → Underutilized cluster resources.
- Too many partitions → Overhead in task scheduling.

Determine Optimal Partitions:

Use the formula:

makefile
partitions = 2-4 * number_of_cores_in_cluster

Monitor partition size: Aim for 128MB to 1GB per partition.

Example:

python
df = df.repartition(10)  # Adjust based on cluster size

6. Narrow vs. Wide Transformations

Answer:

Narrow Transformations: Data resides in the same partition. No shuffle required.
- Example: map(), filter().
Wide Transformations: Data is redistributed across partitions, requiring a shuffle.
- Example: groupBy(), join().

Impact: Wide transformations are more expensive due to network and disk I/O during shuffles.

7. Monitoring and Debugging PySpark Performance Issues

Answer:

Tools:
- Spark UI: View stages, tasks, and shuffles.
- Metrics: Task duration, shuffle size, and GC time.
- Logs: Check executor logs for failures.
Steps:
- Identify skewed partitions using the Spark UI.
- Use .explain() to review the physical plan.
- Optimize joins and shuffles by repartitioning or broadcasting.

8. Incremental Data Processing

Answer:
Use watermarking and timestamp columns to process new or updated records.

Example:

python
from pyspark.sql.functions import col

# Load new data
new_data = spark.read.csv("new_data.csv")
existing_data = spark.read.csv("existing_data.csv")

# Process only new records
incremental_data = new_data.join(existing_data, "id", "left_anti")
incremental_data.write.mode("append").save("output_path")

9. PySpark Join Operation for Large DataFrames

Answer:
Choose the join strategy based on dataset size:

Broadcast Join: When one DataFrame is small.

python
from pyspark.sql.functions import broadcast
result = df1.join(broadcast(df2), "key")

Shuffle Join: Default for large DataFrames.
```
python
result = df1.join(df2, "key")
```

Optimization: Use broadcast() for smaller DataFrames to avoid shuffles.

10. Optimizing Real-Time Data Streams with Spark Streaming

Answer:
Key considerations:

Batch Interval: Choose a small interval for low latency.

python
df.writeStream.trigger(processingTime='10 seconds').start()

Stateful Processing: Use checkpoints to handle failures.

python
stream = df.groupBy("key").count()
stream.writeStream.option("checkpointLocation", "/path/to/checkpoint").start()

Backpressure: Enable spark.streaming.backpressure.enabled to control data flow.

Conclusion

To excel in a PySpark interview, focus on understanding the fundamentals of distributed data processing, optimizing transformations, and handling real-world challenges like skewed data and incremental updates. Practice these concepts with hands-on examples to gain confidence.

Good luck with your Meesho interview! 🚀

Interview Guide

Search This Blog

DevOps Consultant Interview Questions at MNC

Ad