Meesho PySpark Interview Questions for Data Engineers in 2025 Preparing for a PySpark interview? Let’s tackle some commonly asked questions, along with practical answers and insights to ace your next Data Engineering interview at Meesho or any top-tier tech company. 1. Explain how caching and persistence work in PySpark. When would you use cache() versus persist() and what are their performance implications? Answer : Caching : Stores data in memory (default) for faster retrieval. Use cache() when you need to reuse a DataFrame or RDD multiple times in a session without specifying storage levels. Example: python df.cache() df.count() # Triggers caching Persistence : Allows you to specify storage levels (e.g., memory, disk, or a combination). Use persist() when memory is limited, and you want a fallback to disk storage. Example: python from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Triggers persistence Performance Implications : cache() is ...
Flipkart Business Analyst Interview Experience (1-3 Years)
Recently, I appeared for an interview at Flipkart for the position of Business Analyst, and I’m excited to share the questions asked during the process along with how I would approach answering them. The interview covered various domains such as SQL, guesstimates, case studies, managerial scenarios, and Python.
Here’s how I would have tackled each question:
SQL Questions
1️⃣ What are window functions, and how do they differ from aggregate functions? Can you give a use case?
- Answer:
Window functions perform calculations across a set of table rows related to the current row, without collapsing the result set into a single value like aggregate functions.- Example:
Use case: Finding the latest order per customer without grouping data.
- Example:
2️⃣ Explain indexing. When could an index potentially reduce performance, and how would you approach indexing for a large dataset?
- Answer:
Indexing speeds up query retrieval by creating a data structure on columns. However, it can reduce performance during write operations (INSERT, UPDATE, DELETE) due to the overhead of updating indexes.- For large datasets, I’d:
- Create indexes only on frequently queried columns.
- Use covering indexes where possible.
- Avoid excessive indexing, as it can increase storage costs.
- For large datasets, I’d:
3️⃣ Write a query to retrieve customers who made purchases in the last 30 days but didn’t purchase anything in the previous 30 days.
- Answer:
4️⃣ Given a table of transactions, find the top 3 most purchased products for each category.
- Answer:
5️⃣ How would you identify duplicate records in a large dataset and remove only the duplicates, retaining the first occurrence?
- Answer:
Guesstimates
1️⃣ Estimate the number of online food delivery orders in a large metropolitan city over a month.
- Answer:
- Population: 10 million
- Percentage ordering food online: 30%
- Average orders per person per month: 5
- Total orders = 10M * 30% * 5 = 15 million orders/month
2️⃣ How many customer service calls would a telecom company receive daily for a customer base of 1 million?
- Answer:
- Assume 5% of users call customer service daily.
- Calls = 1M * 5% = 50,000 calls/day
Case Studies
1️⃣ A sudden decrease in conversion rate is observed in a popular product category. How would you investigate the cause and propose solutions?
- Answer:
- Investigate:
- Analyze traffic trends (source, location).
- Check product availability and pricing.
- Review customer feedback.
- Propose solutions:
- Optimize pricing strategy.
- Resolve technical issues on the website.
- Enhance product descriptions or images.
- Investigate:
2️⃣ The company is considering adding a new subscription model. How would you evaluate its potential impact on customer lifetime value and revenue?
- Answer:
- Analyze historical data to determine potential upsell opportunities.
- Conduct surveys to gauge customer interest.
- Simulate subscription revenue based on adoption rates and churn predictions.
Managerial Questions
1️⃣ Describe a time when you faced conflicting priorities on a project. How did you manage your workload to meet deadlines?
- Answer:
- I would prioritize based on impact and urgency using the Eisenhower Matrix. Communicating expectations and negotiating timelines with stakeholders has helped me deliver results effectively.
2️⃣ How would you handle a disagreement within the team on an analytical approach?
- Answer:
- Facilitate a discussion to ensure everyone’s perspective is heard. Use data to support decision-making and align the team towards a common goal.
Python Questions
1️⃣ Write a Python function to find the longest consecutive sequence of unique numbers in a list.
- Answer:
2️⃣ If you’re working with a large dataset with missing values, what Python libraries would you use to handle missing data, and why?
- Answer:
- Libraries:
- Pandas: To detect and fill/drop missing values.
- Scikit-learn: For advanced imputation methods like KNNImputer.
- I’d choose based on the dataset and the need for accuracy vs. simplicity.
- Libraries:
Pro Tip
- Always structure your answers logically, especially for guesstimates and case studies.
- Highlight problem-solving skills and focus on clarity.
Follow for more interview experiences and actionable tips!
Hashtag: #Flipkart #BusinessAnalyst #InterviewExperience #DataAnalysis
Comments
Post a Comment