Meesho PySpark Interview Questions for Data Engineers in 2025 Preparing for a PySpark interview? Let’s tackle some commonly asked questions, along with practical answers and insights to ace your next Data Engineering interview at Meesho or any top-tier tech company. 1. Explain how caching and persistence work in PySpark. When would you use cache() versus persist() and what are their performance implications? Answer : Caching : Stores data in memory (default) for faster retrieval. Use cache() when you need to reuse a DataFrame or RDD multiple times in a session without specifying storage levels. Example: python df.cache() df.count() # Triggers caching Persistence : Allows you to specify storage levels (e.g., memory, disk, or a combination). Use persist() when memory is limited, and you want a fallback to disk storage. Example: python from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Triggers persistence Performance Implications : cache() is ...
Siemens Data Analyst Interview Experience (1–3 Years): A Comprehensive Breakdown Landing a data analyst role at a reputed company like Siemens demands a solid understanding of SQL, Python, and Power BI. Here's how I tackled the questions asked during the interview, along with detailed explanations and solutions. SQL Questions 1. Find Devices Exceeding Daily Average Energy Usage by 20% in the Last Month The table EnergyConsumption has columns: DeviceID , Timestamp , and EnergyUsed . Solution: sql WITH DailyUsage AS ( SELECT DeviceID, CAST ( Timestamp AS DATE ) AS UsageDate, AVG (EnergyUsed) AS AvgDailyUsage FROM EnergyConsumption WHERE Timestamp >= DATEADD( MONTH , -1 , GETDATE()) GROUP BY DeviceID, CAST ( Timestamp AS DATE ) ), ExceedingDevices AS ( SELECT e.DeviceID, e.Timestamp, e.EnergyUsed, d.AvgDailyUsage FROM EnergyConsumption e JOIN DailyU...