Dask-Based Aggregation of CSV Data: Mean and Median Calculation

  • Share this:

Code introduction


This function reads a CSV file from a specified path and calculates the mean and median of the 'value' column. It uses Dask for big data processing to improve computational efficiency.


Technology Stack : Dask, NumPy, Pandas

Code Type : The type of code

Code Difficulty : Intermediate


                
                    
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client

def aggregate_data(file_path):
    # Initialize Dask client
    client = Client()

    # Read the CSV file into a Dask DataFrame
    df = dd.read_csv(file_path)

    # Calculate the mean of a specific column
    mean_value = df['value'].mean().compute()

    # Calculate the median of a specific column
    median_value = df['value'].median().compute()

    # Return the results as a pandas DataFrame
    result = pd.DataFrame({
        'Mean': [mean_value],
        'Median': [median_value]
    })

    return result                
              
Tags: