Optimizing large Excel files with Pandas & Dask

Vishvaraj Dhanawade Sat, 12/11/2022 - 05:37

Posted By

Vishvaraj Dhanawade

Date Posted

12-Nov-2022

Excel is still a "go-to" choice for table calculation of small, medium, and large datasets. However, it gets tedious when the dataset gets too huge – say a million rows and hundreds of columns, as it tends to work slowly, hang, or crash. Handling such humongous datasets is not easy for Excel alone, and that's where Python comes into the picture. Python helps users overcome the shortcomings of Excel and lets users use more cells than Excel can initially handle.

While working with a client, we needed to optimize Excel file operations to accommodate large amounts of data. Their current dashboard worked well, but there was some delay and lag for the end-users due to its sheer size. Uploading files or doing simple tasks like searching for keywords would take one to two minutes. Our solution to this issue was to perfect their files using Python. There are many libraries to work within Excel in Python. For example, openpyxl, xlsxwriter, pyxlsb, and xlrd, to name a few. But we chose Pandas (Python Data Analysis Library) and Dask because they were sustainable in this case.

Here is what I am planning to talk about in this article -

File read operations using Pandas & Dask
Keyword searching operations using Pandas & Dask
My observations about Pandas & Dask

Let’s discuss them in detail.

File read operations

The file needs to be read to process the data in this case. Optimizing file read operations will save vast amounts of time. Users can read data in files using the default function and pass it to libraries like Dandas, Dask, and PyXML. But that takes a lot of time compared to in-built functionalities.

The screenshots below display the time taken to read Excel/CSV files using Pandas and Dask libraries.

File reading start - time taken to read Excel/CSV files using Pandas and Dask libraries

Fig: Pandas reading Excel and CSV files timing

Dask reading Excel and CSV files timing

Fig: Dask reading Excel and CSV files timing

Both libraries take a similar time for Excel file reading. Dask is much faster with CSV files as compared to Pandas. But while reading Excel files, we need to use the Pandas DataFrame to read files in Dask.

Reading CSV files takes less time than XLS files, and users can save up to 10-15 seconds without affecting/modifying data types. So, it is used to optimize computations, like creating a background job to convert an Excel file to a CSV file once the users upload it.

Keyword searching

Let's discuss the demo code that's going to be used for testing. Read these files as a global variable so that objects can be shared between multiprocesses / workers.

Pandas:

import pandas as pd
from pandasql import sqldf
import multiprocessing
import time

print("starting...")
fstart = time.time()
df = pd.read_csv('testfile22.csv')
print(f"File loading time: {str(time.time()- fstart)}")
columns = df.columns
lebels = ['row_num', 'Date', 'Headline', 'URL',  'Hit Sentence', 'Influencer']

df['Hit Sentence'] = df['Hit Sentence'].str.lower()
df['Hit Sentence'] = df['Hit Sentence'].astype(str)
df['row_num'] = range(1, len(df) + 1)
df = df.astype(str)



def find_word(word, selected_col='Hit Sentence'):
    fwdf = sqldf(f'''SELECT row_num, Date, Headline, URL, Influencer, "{selected_col}" FROM df WHERE "{selected_col}" LIKE "% {word} %"''')
    fwdf['word_count'] = fwdf[selected_col].str.count(word)
    return fwdf


if __name__ == '__main__':
    search_words = ['career courses', 'continue education', 'continued education', 'continues education', 'course',
                    'courses', 'coursework', 'educational program', 'Google scholarship', 'grant', 'grants',
                    'hybrid learning', 'in-house education', 'in-person programming', 'job certification',
                    'job certification program', 'job certification programs', 'job program', 'job programs',
                    'lifelong education', 'lifelong learning', 'ongoing education', 'online program', 'online seminar',
                    'online teaching', 'orientation', 'orientations', 'Pell grant', 'Pell grants', 'scholarship',
                    'self-help', 'self-help tools', 'skills course', 'work grant', 'work grants', 'advice',
                    'apprentice', 'apprenticeship', 'apprenticeships ', 'apprenticeship program', 'coach', 'coached',
                    'coaches', 'coaching', 'counsel  ', 'counseling', 'remote coaching', 'feedback', 'gain experience',
                    'guide', 'guidance', 'guiding', 'instruct', 'instruction', 'invest', 'invested', 'investing',
                    'investment', 'investments', 'invest in', 'co-invest', 'co-invested', 'co-investing', 'mentor',
                    'mentors  ', 'mentoring', 'mentorship   ', 'assure', 'assured', 'assurances', 'balance',
                    'balanced ', 'balancing ', 'before personal needs', 'both mom and teaching assistant',
                    'come in for only part of the week', 'come in for part of the week', 'comfortable',
                    'comfort of my home', 'complacent', 'fewer hours', 'harmony', 'harmonious', 'hobby', 'hobbies',
                    'juggle', 'juggling', 'lifestyle', 'manage', 'managing', 'more time for hobbies', 'nutrition',
                    'personal life', 'personal time', 'personal priorities', 'quality of life', 'reduce hours',
                    'reduced hours', 'reducing hours', 'shorter hours']

    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
    start_cal = time.time()
    res = pool.map(find_word, search_words)
    print(f"Word calculation took: {str(time.time() - start_cal)}")
    pool.close()
    print("finished...")

# pip install "dask[datagrame]" 
# pip install dask-sql
import time
import json
import dask.dataframe as dd
from distributed import Client
from dask.delayed import delayed
import pandas as pd


print("starting....")
fstart = time.time()
parts = delayed(pd.read_csv)("all.csv")
df = dd.from_delayed(parts)

print(len(df))
print(f"File loading time: {str(time.time()- fstart)}")
df.head()   # check first rows


df = df.astype(str)

columns = df.columns
result_dict = {}


def find_word(word, selected_col='Hit Sentence'):
    print(word)
    fwdf = df[df[selected_col].str.contains(f" {word} | {word}-")]
    fwdf['word_count'] = fwdf[selected_col].str.count(f" {word} ")
    fwdf['word_count2'] = fwdf[selected_col].str.count(f" {word}-")
    fwdf['word_count_total'] = fwdf['word_count'] + fwdf['word_count2']

    lebels = ['Date', 'Headline', 'URL', 'Hit Sentence', 'Influencer', 'word_count', 'word_count2', 'word_count_total']

    return word, fwdf[lebels]


if __name__ == '__main__':
    client = Client()
    search_words = ["air", "stop", "good", "job", "hospital", "covid", "career courses", "continue education",
                    "continued education", "continues education", "course",
                     "courses", "coursework", "educational program", "Google scholarship", "grant", "grants",
                     "hybrid learning", "job certification",
                     "job certification program", "job certification programs", "job program", "job programs",
                     "lifelong education", "lifelong learning", "ongoing education", "online program", "online seminar",
                     "online teaching", "orientation", "orientations", "Pell grant", "Pell grants", "scholarship",
                      "skills course", "work grant", "work grants", "advice",
                     "apprentice", "apprenticeship", "apprenticeships ", "apprenticeship program", "coach", "coached",
                     "coaches", "coaching", "counsel  ", "counseling", "remote coaching", "feedback", "gain experience",
                     "guide", "guidance", "guiding", "instruct", "instruction", "invest", "invested", "investing",
                     "investment", "investments", "invest in", "mentor",
                     "mentors  ", "mentoring", "mentorship   ", "assure", "assured", "assurances", "balance",
                     "balanced ", "balancing ", "before personal needs", "both mom and teaching assistant",
                     "come in for only part of the week", "come in for part of the week", "comfortable",
                     "comfort of my home", "complacent", "fewer hours", "harmony", "harmonious", "hobby", "hobbies",
                     "juggle", "juggling", "lifestyle", "manage", "managing", "more time for hobbies", "nutrition",
                     "personal life", "personal time", "personal priorities", "quality of life", "reduce hours",
                     "reduced hours", "reducing hours", "shorter hours",
                   # "co-invest", "co-invested", "co-investing", "self-help", "self-help tools", , "in-house education", "in-person programming"
    ]
    file_data = json.loads(open('allwords.json', 'r').read())
    search_words = file_data['keywords']

    selected_col = 'Hit Sentence'
    print(f"search keywords len: {str(len(search_words))}")

    cstart = time.time()

    x = client.map(find_word, search_words)
    res_dict = {}
    for val in x:
        w, rdf = val.result()
        res_dict[w] = rdf

    print(f"Word calculation time: {str(time.time() - cstart)}")
    print(len(res_dict))

    print("end....")

For searching a keyword, the users must provide the column name they intend to search in and the number of keywords, i.e., 100, 500, 1000, 2500, and so on. For example, the user wants to search 1,000 keywords in the selected column and return the result to the user dashboard. The result will have the selected columns and rows containing the keywords and an extra column that displays the keyword occurrences in rows. The below computation is done on a 4 CPU with 16 GB RAM.

Using pandas to find 996 keywords on 40k rows

Fig: Using Pandas to find 996 keywords on 40k rows

Using pandas to find 2644 keywords on 40k rows

Fig: Using Pandas to find 2644 keywords on 40k rows

Fig: Using Dask to find 996 keywords on 40k rows

Fig: Using Dask to find 2644 keywords on 40k rows

Observations

The following observations are done on 8 CPU 16 GB RAM.

Pandas 996 keywords with 40k rows

Fig: Pandas 996 Keywords with 40k rows

Pandas 2644 keywords with 40k rows

Fig: Pandas 2644 keywords with 40k rows

Dask 996 Keywords with 40k rows

Fig: Dask 996 Keywords with 40k rows

Dask 2644 Keywords with 40k rows

Fig: Dask 2644 Keywords with 40k rows

While trying combinations with Pandas, Dask, and different files with different sizes and numbers of rows, the Pandas library with multiprocessing is perfect for file sizes below 500MB or 1 million rows as it works on a single DataFrame.

Use Dask only if the file size is more than 1GB and data is more than 1 million rows because Dask workers take at least 15-20 seconds to copy files from the main program to the workers' program and thDask 2644 Keywords with 40k rowsen compute logic on it. In short, Dask works very well in the case of extensive data.

Scale your workflows like never before

For the above use case, the Pandas library was more suitable. It takes a few prerequisites, like the file size must be 8 - 100 MB, the number of rows should be lesser than a million, and it should be a single-page application. But it meant not investing time loading files on Dask and keeping Dask workers alive.

If you work on Excel files or any other file-handling operations and face roadblocks or need to understand the most preferred approach, email us at contact@opcito.com, and an expert will ensure you with end-to-end support to run your operations in Excel more efficiently.