Skip to main content

Meesho PySpark Interview Questions for Data Engineers in 2025

Meesho PySpark Interview Questions for Data Engineers in 2025 Preparing for a PySpark interview? Let’s tackle some commonly asked questions, along with practical answers and insights to ace your next Data Engineering interview at Meesho or any top-tier tech company. 1. Explain how caching and persistence work in PySpark. When would you use cache() versus persist() and what are their performance implications? Answer : Caching : Stores data in memory (default) for faster retrieval. Use cache() when you need to reuse a DataFrame or RDD multiple times in a session without specifying storage levels. Example: python df.cache() df.count() # Triggers caching Persistence : Allows you to specify storage levels (e.g., memory, disk, or a combination). Use persist() when memory is limited, and you want a fallback to disk storage. Example: python from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Triggers persistence Performance Implications : cache() is ...

Ad

Wells Fargo Data Analyst Interview and Answers

My Wells Fargo Data Analyst Interview Experience (1–3 Years)

CTC: 16 LPA

As a data enthusiast and SQL aficionado, I recently tackled some challenging SQL and Python questions in a Wells Fargo interview for a Data Analyst position. The experience was both rewarding and insightful. Here’s how I approached these questions.




SQL Questions

1. Identify Inactive Accounts

To identify accounts inactive for more than 12 months:

sql

SELECT AccountID, CustomerID, Balance FROM Accounts WHERE LastTransactionDate < DATEADD(YEAR, -1, GETDATE());

This query filters accounts where the LastTransactionDate is older than one year.


2. Top 3 Accounts by Transaction Volume Per Month

Using ROW_NUMBER() to rank accounts by total transaction volume for each month:

sql

WITH MonthlyVolume AS ( SELECT AccountID, SUM(Amount) AS TotalVolume, MONTH(TransactionDate) AS TransactionMonth, YEAR(TransactionDate) AS TransactionYear FROM Transactions GROUP BY AccountID, MONTH(TransactionDate), YEAR(TransactionDate) ), RankedAccounts AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY TransactionYear, TransactionMonth ORDER BY TotalVolume DESC) AS Rank FROM MonthlyVolume ) SELECT AccountID, TotalVolume, TransactionMonth, TransactionYear FROM RankedAccounts WHERE Rank <= 3;

This query calculates monthly transaction volumes and selects the top three accounts per month.


3. Average Loan Amount for Approved Applications in the Last Six Months

To calculate the average loan amount for approved applications submitted in the past six months:

sql
SELECT AVG(LoanAmount) AS AverageLoanAmount
FROM LoanApplications WHERE ApprovalStatus = 'Approved' AND ApplicationDate >= DATEADD(MONTH, -6, GETDATE());

The DATEADD function ensures the query only considers recent applications.


4. Clustered vs. Non-Clustered Index

A clustered index determines the physical order of data in a table, meaning there can only be one per table. For instance, a clustered index on CustomerID organizes rows by CustomerID.
A non-clustered index, however, is a separate structure that contains pointers to the table's data, allowing for multiple indexes. Use a clustered index for primary keys and non-clustered indexes for frequently queried columns.


5. Self-Join Scenario

A self-join is useful for comparing rows within the same table. For example, finding employees earning more than their managers:

sql
SELECT e1.EmployeeID, e1.Salary
FROM Employees e1 JOIN Employees e2 ON e1.ManagerID = e2.EmployeeID WHERE e1.Salary > e2.Salary;

This approach compares an employee’s salary to their manager’s.


Python Questions

6. Convert JSON to a DataFrame

To process a JSON file into a structured DataFrame:

python
import pandas as pd
import json def json_to_dataframe(file_path): with open(file_path, 'r') as file: data = json.load(file) df = pd.DataFrame(data) return df # Example usage: df = json_to_dataframe('customers.json') print(df.head())

7. Calculate Moving Average

To compute a moving average for a numerical column:

python
def moving_average(data, column, window_size):
data[f'{column}_MovingAvg'] = data[column].rolling(window=window_size).mean() return data # Example usage: df = moving_average(df, 'Sales', 3) print(df.head())

8. Data Validation and Cleaning

Python libraries like pandas simplify validation and cleaning:

python
import pandas as pd
def clean_data(df): df = df.drop_duplicates() # Remove duplicates df = df.fillna(method='ffill') # Handle null values by forward filling return df # Example usage: cleaned_df = clean_data(raw_df) print(cleaned_df.info())

9. Detect Outliers Using IQR

To identify outliers using the IQR method:

python
def detect_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25) Q3 = data[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR return data[(data[column] < lower_bound) | (data[column] > upper_bound)] # Example usage: outliers = detect_outliers_iqr(df, 'Amount') print(outliers)

Reflections and Key Takeaways

This interview experience highlighted the importance of solid SQL skills and Python proficiency in real-world data analysis. The key is to:

  1. Understand the Problem: Break it down into smaller steps.
  2. Optimize Solutions: Ensure queries and scripts are efficient.
  3. Demonstrate Versatility: Use various SQL functions and Python libraries effectively.

This hands-on challenge boosted my confidence and clarified how to apply theoretical knowledge to solve practical problems.

Comments

Ad

Popular posts from this blog

Deloitte Data Analyst Interview Questions and Answer

Deloitte Data Analyst Interview Questions: Insights and My Personal Approach to Answering Them 1. Tell us about yourself and your current job responsibilities. Example Answer: "I am currently working as a Data Analyst at [Company Name], where I manage and analyze large datasets to drive business insights. My responsibilities include creating and maintaining Power BI dashboards, performing advanced SQL queries to extract and transform data, and collaborating with cross-functional teams to improve data-driven decision-making. Recently, I worked on a project where I streamlined reporting processes using DAX measures and optimized SQL queries, reducing report generation time by 30%." 2. Can you share some challenges you encountered in your recent project involving Power BI dashboards, and how did you resolve them? Example Challenge: In a recent project, one of the key challenges was handling complex relationships between multiple datasets, which caused performance issues and in...

Deloitte Recent Interview Questions for Data Analyst Position November 2024

Deloitte Recent Interview Insights for a Data Analyst Position (0-3 Years) When preparing for an interview with a firm like Deloitte, particularly for a data analyst role, it's crucial to combine technical proficiency with real-world experiences. Below are my personalized insights into common interview questions. 1. Tell us about yourself and your current job responsibilities. Hi, I’m [Your Name], currently working as a Sr. Data Analyst with over 3.5 years of experience. I specialize in creating interactive dashboards, analyzing large datasets, and automating workflows. My responsibilities include developing Power BI dashboards for financial and operational reporting, analyzing trends in customer churn rates, and collaborating with cross-functional teams to implement data-driven solutions. Here’s a quick glimpse of my professional journey: Reporting financial metrics using Power BI, Excel, and SQL. Designing dashboards to track sales and marketing KPIs. Teaching data analysis conce...

EXL Interview question and answer for Power BI Developer (3 Years of Experience)

EXL Interview Experience for Power BI Developer (3 Years of Experience) I recently appeared for an interview at EXL for the role of Power BI Developer . The selection process consisted of three rounds: 2 Technical Rounds 1 Managerial Round Here, I’ll share the key technical questions I encountered, along with my approach to answering them. SQL Questions 1️⃣ Write a SQL query to find the second most recent order date for each customer from a table Orders ( OrderID , CustomerID , OrderDate ). To solve this, I used the ROW_NUMBER() window function: sql WITH RankedOrders AS ( SELECT CustomerID, OrderDate, ROW_NUMBER () OVER ( PARTITION BY CustomerID ORDER BY OrderDate DESC ) AS RowNum FROM Orders ) SELECT CustomerID, OrderDate AS SecondMostRecentOrderDate FROM RankedOrders WHERE RowNum = 2 ; 2️⃣ Write a query to find the nth highest salary from a table Employees with columns ( EmployeeID , Name , Salary ). The DENSE_RANK() fu...