Skip to main content

Meesho PySpark Interview Questions for Data Engineers in 2025

Meesho PySpark Interview Questions for Data Engineers in 2025 Preparing for a PySpark interview? Let’s tackle some commonly asked questions, along with practical answers and insights to ace your next Data Engineering interview at Meesho or any top-tier tech company. 1. Explain how caching and persistence work in PySpark. When would you use cache() versus persist() and what are their performance implications? Answer : Caching : Stores data in memory (default) for faster retrieval. Use cache() when you need to reuse a DataFrame or RDD multiple times in a session without specifying storage levels. Example: python df.cache() df.count() # Triggers caching Persistence : Allows you to specify storage levels (e.g., memory, disk, or a combination). Use persist() when memory is limited, and you want a fallback to disk storage. Example: python from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Triggers persistence Performance Implications : cache() is ...

Ad

Siemens Data Analyst question asked in recent Interview

Siemens Data Analyst Interview Experience (1–3 Years): A Comprehensive Breakdown

Landing a data analyst role at a reputed company like Siemens demands a solid understanding of SQL, Python, and Power BI. Here's how I tackled the questions asked during the interview, along with detailed explanations and solutions.




SQL Questions

1. Find Devices Exceeding Daily Average Energy Usage by 20% in the Last Month

The table EnergyConsumption has columns: DeviceID, Timestamp, and EnergyUsed.
Solution:

sql

WITH DailyUsage AS (
SELECT DeviceID, CAST(Timestamp AS DATE) AS UsageDate, AVG(EnergyUsed) AS AvgDailyUsage FROM EnergyConsumption WHERE Timestamp >= DATEADD(MONTH, -1, GETDATE()) GROUP BY DeviceID, CAST(Timestamp AS DATE) ), ExceedingDevices AS ( SELECT e.DeviceID, e.Timestamp, e.EnergyUsed, d.AvgDailyUsage FROM EnergyConsumption e JOIN DailyUsage d ON e.DeviceID = d.DeviceID AND CAST(e.Timestamp AS DATE) = d.UsageDate WHERE e.EnergyUsed > 1.2 * d.AvgDailyUsage ) SELECT DISTINCT DeviceID FROM ExceedingDevices;

Approach:

  1. Calculate the daily average energy usage for each device.
  2. Compare each device’s energy usage with 120% of its daily average.
  3. Return devices exceeding this threshold.

2. Calculate Total Operational Time and Average Output per Machine in the Last Quarter

The table Machines has columns: MachineID, StartTime, EndTime, and Output.
Solution:

sql

SELECT
MachineID, SUM(DATEDIFF(MINUTE, StartTime, EndTime)) AS TotalOperationalTime, AVG(Output) AS AvgOutput FROM Machines WHERE StartTime >= DATEADD(QUARTER, -1, GETDATE()) GROUP BY MachineID;

Approach:

  • Use DATEDIFF to calculate operational time in minutes for each entry.
  • Aggregate total time and average output for the last quarter.

3. Rank Suppliers by Rating Within Each Region

The table Suppliers contains columns: SupplierID, Region, and Rating.
Solution:

sql

SELECT
SupplierID, Region, Rating, RANK() OVER (PARTITION BY Region ORDER BY Rating DESC) AS RankWithinRegion FROM Suppliers;

Approach:

  • Use the RANK() function with PARTITION BY to rank suppliers within each region based on their rating.

4. Differences Between OLAP and OLTP Databases

OLAP (Online Analytical Processing):

  • Used for data analysis and reporting.
  • Example: A data warehouse storing historical sales data for analysis.

OLTP (Online Transaction Processing):

  • Used for real-time transactional operations.
  • Example: A retail system processing customer orders and payments.

5. Optimize a SQL Query with Multiple Joins and Subqueries

Steps:

  1. Indexing: Ensure appropriate indexes exist on join and filter columns.
  2. Simplify Subqueries: Replace subqueries with joins or CTEs where possible.
  3. **Avoid SELECT *: Query only necessary columns.
  4. Query Execution Plan: Use the query execution plan to identify bottlenecks.
  5. Partitioning: If working with large datasets, consider table partitioning.

Python Questions

6. Simulate and Visualize Machine Efficiency

Solution:

python

import numpy as np
import matplotlib.pyplot as plt # Simulating efficiency metrics time = np.arange(0, 100, 1) efficiency = np.sin(time * 0.1) + np.random.normal(0, 0.1, len(time)) # Visualization plt.plot(time, efficiency) plt.title("Machine Operational Efficiency Over Time") plt.xlabel("Time") plt.ylabel("Efficiency") plt.show()

7. Connect to SQL Database and Save Results to CSV

Solution:

python

import pyodbc
import pandas as pd conn = pyodbc.connect('DRIVER={SQL Server};SERVER=server_name;DATABASE=db_name;UID=user;PWD=password') query = "SELECT * FROM table_name" data = pd.read_sql(query, conn) data.to_csv("output.csv", index=False)

8. Calculate Correlation Between Two Columns

Solution:

python

import pandas as pd
def calculate_correlation(data, col1, col2): return data[col1].corr(data[col2]) # Example df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) correlation = calculate_correlation(df, 'A', 'B') print("Correlation:", correlation)

9. Identify and Visualize Trends in Manufacturing Data

Solution:

python

import pandas as pd
import matplotlib.pyplot as plt # Sample data data = pd.DataFrame({ 'Date': pd.date_range(start='1/1/2023', periods=100), 'Output': np.random.randint(100, 200, 100) }) # Visualization data.set_index('Date')['Output'].plot(title='Manufacturing Trends') plt.show()

Power BI Questions

10. Design a Dashboard for Production Line Monitoring

  • Include KPIs like Total Output, Downtime, Efficiency %.
  • Use visuals such as bar charts (factory-wise output), line charts (efficiency over time), and cards for KPIs.
  • Use slicers to filter by factory, product, or date.

11. Integrate Data From Multiple Sources

  1. Use Power BI’s Get Data feature to connect to SQL Server, Excel, or APIs.
  2. Model the data using relationships.
  3. Use Power Query to clean and transform the data.

12. Direct Query: Advantages and Limitations

Advantages:

  • Real-time data updates.
  • Suitable for large datasets stored in optimized databases.

Limitations:

  • Slower report performance for complex queries.
  • Limited DAX functionality.

13. Simulate Scenarios With What-If Parameters

  • Use Power BI’s What-If Parameter feature to create variables (e.g., resource availability).
  • Adjust slicers to simulate and compare outcomes.

14. DAX Measure for Cumulative Production Output

Solution:

DAX

CumulativeProduction = CALCULATE( SUM(Production[Output]), FILTER( ALL(Production[Date]), Production[Date] <= MAX(Production[Date]) ) )

Closing Thoughts

Preparing for a Siemens Data Analyst interview requires a blend of SQL expertise, Python programming, and Power BI proficiency. Focus on problem-solving, optimizing queries, and presenting actionable insights to stand out.

Good luck with your preparation! 🚀

Comments

Ad

Popular posts from this blog

Deloitte Data Analyst Interview Questions and Answer

Deloitte Data Analyst Interview Questions: Insights and My Personal Approach to Answering Them 1. Tell us about yourself and your current job responsibilities. Example Answer: "I am currently working as a Data Analyst at [Company Name], where I manage and analyze large datasets to drive business insights. My responsibilities include creating and maintaining Power BI dashboards, performing advanced SQL queries to extract and transform data, and collaborating with cross-functional teams to improve data-driven decision-making. Recently, I worked on a project where I streamlined reporting processes using DAX measures and optimized SQL queries, reducing report generation time by 30%." 2. Can you share some challenges you encountered in your recent project involving Power BI dashboards, and how did you resolve them? Example Challenge: In a recent project, one of the key challenges was handling complex relationships between multiple datasets, which caused performance issues and in...

Deloitte Recent Interview Questions for Data Analyst Position November 2024

Deloitte Recent Interview Insights for a Data Analyst Position (0-3 Years) When preparing for an interview with a firm like Deloitte, particularly for a data analyst role, it's crucial to combine technical proficiency with real-world experiences. Below are my personalized insights into common interview questions. 1. Tell us about yourself and your current job responsibilities. Hi, I’m [Your Name], currently working as a Sr. Data Analyst with over 3.5 years of experience. I specialize in creating interactive dashboards, analyzing large datasets, and automating workflows. My responsibilities include developing Power BI dashboards for financial and operational reporting, analyzing trends in customer churn rates, and collaborating with cross-functional teams to implement data-driven solutions. Here’s a quick glimpse of my professional journey: Reporting financial metrics using Power BI, Excel, and SQL. Designing dashboards to track sales and marketing KPIs. Teaching data analysis conce...

EXL Interview question and answer for Power BI Developer (3 Years of Experience)

EXL Interview Experience for Power BI Developer (3 Years of Experience) I recently appeared for an interview at EXL for the role of Power BI Developer . The selection process consisted of three rounds: 2 Technical Rounds 1 Managerial Round Here, I’ll share the key technical questions I encountered, along with my approach to answering them. SQL Questions 1️⃣ Write a SQL query to find the second most recent order date for each customer from a table Orders ( OrderID , CustomerID , OrderDate ). To solve this, I used the ROW_NUMBER() window function: sql WITH RankedOrders AS ( SELECT CustomerID, OrderDate, ROW_NUMBER () OVER ( PARTITION BY CustomerID ORDER BY OrderDate DESC ) AS RowNum FROM Orders ) SELECT CustomerID, OrderDate AS SecondMostRecentOrderDate FROM RankedOrders WHERE RowNum = 2 ; 2️⃣ Write a query to find the nth highest salary from a table Employees with columns ( EmployeeID , Name , Salary ). The DENSE_RANK() fu...