Optimizing large Excel files with Pandas & Dask
Excel is still a "go-to" choice for table calculation of small, medium, and large datasets. However, it gets tedious when the dataset gets too huge – say a million rows and hundreds of columns, as it tends to work slowly, hang, or crash. Handling such humongous datasets is not easy for Excel alone, and that's where Python comes into the picture. Python helps users overcome the shortcomings of Excel and lets users use more cells than Excel can initially handle.
While working with a client, we needed to optimize Excel file operations to accommodate large amounts of data. Their current dashboard worked well, but there was some delay and lag for the end-users due to its sheer size. Uploading files or doing simple tasks like searching for keywords would take one to two minutes. Our solution to this issue was to perfect their files using Python. There are many libraries to work within Excel in Python. For example, openpyxl, xlsxwriter, pyxlsb, and xlrd, to name a few. But we chose Pandas (Python Data Analysis Library) and Dask because they were sustainable in this case.
Here is what I am planning to talk about in this article -
- File read operations using Pandas & Dask
- Keyword searching operations using Pandas & Dask
- My observations about Pandas & Dask
Let’s discuss them in detail.
File read operations
The file needs to be read to process the data in this case. Optimizing file read operations will save vast amounts of time. Users can read data in files using the default function and pass it to libraries like Dandas, Dask, and PyXML. But that takes a lot of time compared to in-built functionalities.
The screenshots below display the time taken to read Excel/CSV files using Pandas and Dask libraries.
Fig: Pandas reading Excel and CSV files timing
Fig: Dask reading Excel and CSV files timing
Both libraries take a similar time for Excel file reading. Dask is much faster with CSV files as compared to Pandas. But while reading Excel files, we need to use the Pandas DataFrame to read files in Dask.
Reading CSV files takes less time than XLS files, and users can save up to 10-15 seconds without affecting/modifying data types. So, it is used to optimize computations, like creating a background job to convert an Excel file to a CSV file once the users upload it.
Keyword searching
Let's discuss the demo code that's going to be used for testing. Read these files as a global variable so that objects can be shared between multiprocesses / workers.
Pandas:
import pandas as pd from pandasql import sqldf import multiprocessing import time print("starting...") fstart = time.time() df = pd.read_csv('testfile22.csv') print(f"File loading time: {str(time.time()- fstart)}") columns = df.columns lebels = ['row_num', 'Date', 'Headline', 'URL', 'Hit Sentence', 'Influencer'] df['Hit Sentence'] = df['Hit Sentence'].str.lower() df['Hit Sentence'] = df['Hit Sentence'].astype(str) df['row_num'] = range(1, len(df) + 1) df = df.astype(str) def find_word(word, selected_col='Hit Sentence'): fwdf = sqldf(f'''SELECT row_num, Date, Headline, URL, Influencer, "{selected_col}" FROM df WHERE "{selected_col}" LIKE "% {word} %"''') fwdf['word_count'] = fwdf[selected_col].str.count(word) return fwdf if __name__ == '__main__': search_words = ['career courses', 'continue education', 'continued education', 'continues education', 'course', 'courses', 'coursework', 'educational program', 'Google scholarship', 'grant', 'grants', 'hybrid learning', 'in-house education', 'in-person programming', 'job certification', 'job certification program', 'job certification programs', 'job program', 'job programs', 'lifelong education', 'lifelong learning', 'ongoing education', 'online program', 'online seminar', 'online teaching', 'orientation', 'orientations', 'Pell grant', 'Pell grants', 'scholarship', 'self-help', 'self-help tools', 'skills course', 'work grant', 'work grants', 'advice', 'apprentice', 'apprenticeship', 'apprenticeships ', 'apprenticeship program', 'coach', 'coached', 'coaches', 'coaching', 'counsel ', 'counseling', 'remote coaching', 'feedback', 'gain experience', 'guide', 'guidance', 'guiding', 'instruct', 'instruction', 'invest', 'invested', 'investing', 'investment', 'investments', 'invest in', 'co-invest', 'co-invested', 'co-investing', 'mentor', 'mentors ', 'mentoring', 'mentorship ', 'assure', 'assured', 'assurances', 'balance', 'balanced ', 'balancing ', 'before personal needs', 'both mom and teaching assistant', 'come in for only part of the week', 'come in for part of the week', 'comfortable', 'comfort of my home', 'complacent', 'fewer hours', 'harmony', 'harmonious', 'hobby', 'hobbies', 'juggle', 'juggling', 'lifestyle', 'manage', 'managing', 'more time for hobbies', 'nutrition', 'personal life', 'personal time', 'personal priorities', 'quality of life', 'reduce hours', 'reduced hours', 'reducing hours', 'shorter hours'] pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) start_cal = time.time() res = pool.map(find_word, search_words) print(f"Word calculation took: {str(time.time() - start_cal)}") pool.close() print("finished...")
# pip install "dask[datagrame]" # pip install dask-sql import time import json import dask.dataframe as dd from distributed import Client from dask.delayed import delayed import pandas as pd print("starting....") fstart = time.time() parts = delayed(pd.read_csv)("all.csv") df = dd.from_delayed(parts) print(len(df)) print(f"File loading time: {str(time.time()- fstart)}") df.head() # check first rows df = df.astype(str) columns = df.columns result_dict = {} def find_word(word, selected_col='Hit Sentence'): print(word) fwdf = df[df[selected_col].str.contains(f" {word} | {word}-")] fwdf['word_count'] = fwdf[selected_col].str.count(f" {word} ") fwdf['word_count2'] = fwdf[selected_col].str.count(f" {word}-") fwdf['word_count_total'] = fwdf['word_count'] + fwdf['word_count2'] lebels = ['Date', 'Headline', 'URL', 'Hit Sentence', 'Influencer', 'word_count', 'word_count2', 'word_count_total'] return word, fwdf[lebels] if __name__ == '__main__': client = Client() search_words = ["air", "stop", "good", "job", "hospital", "covid", "career courses", "continue education", "continued education", "continues education", "course", "courses", "coursework", "educational program", "Google scholarship", "grant", "grants", "hybrid learning", "job certification", "job certification program", "job certification programs", "job program", "job programs", "lifelong education", "lifelong learning", "ongoing education", "online program", "online seminar", "online teaching", "orientation", "orientations", "Pell grant", "Pell grants", "scholarship", "skills course", "work grant", "work grants", "advice", "apprentice", "apprenticeship", "apprenticeships ", "apprenticeship program", "coach", "coached", "coaches", "coaching", "counsel ", "counseling", "remote coaching", "feedback", "gain experience", "guide", "guidance", "guiding", "instruct", "instruction", "invest", "invested", "investing", "investment", "investments", "invest in", "mentor", "mentors ", "mentoring", "mentorship ", "assure", "assured", "assurances", "balance", "balanced ", "balancing ", "before personal needs", "both mom and teaching assistant", "come in for only part of the week", "come in for part of the week", "comfortable", "comfort of my home", "complacent", "fewer hours", "harmony", "harmonious", "hobby", "hobbies", "juggle", "juggling", "lifestyle", "manage", "managing", "more time for hobbies", "nutrition", "personal life", "personal time", "personal priorities", "quality of life", "reduce hours", "reduced hours", "reducing hours", "shorter hours", # "co-invest", "co-invested", "co-investing", "self-help", "self-help tools", , "in-house education", "in-person programming" ] file_data = json.loads(open('allwords.json', 'r').read()) search_words = file_data['keywords'] selected_col = 'Hit Sentence' print(f"search keywords len: {str(len(search_words))}") cstart = time.time() x = client.map(find_word, search_words) res_dict = {} for val in x: w, rdf = val.result() res_dict[w] = rdf print(f"Word calculation time: {str(time.time() - cstart)}") print(len(res_dict)) print("end....")
For searching a keyword, the users must provide the column name they intend to search in and the number of keywords, i.e., 100, 500, 1000, 2500, and so on. For example, the user wants to search 1,000 keywords in the selected column and return the result to the user dashboard. The result will have the selected columns and rows containing the keywords and an extra column that displays the keyword occurrences in rows. The below computation is done on a 4 CPU with 16 GB RAM.
Fig: Using Pandas to find 996 keywords on 40k rows
Fig: Using Pandas to find 2644 keywords on 40k rows
Fig: Using Dask to find 996 keywords on 40k rows
Fig: Using Dask to find 2644 keywords on 40k rows
Observations
The following observations are done on 8 CPU 16 GB RAM.
Fig: Pandas 996 Keywords with 40k rows
Fig: Pandas 2644 keywords with 40k rows
Fig: Dask 996 Keywords with 40k rows
Fig: Dask 2644 Keywords with 40k rows
While trying combinations with Pandas, Dask, and different files with different sizes and numbers of rows, the Pandas library with multiprocessing is perfect for file sizes below 500MB or 1 million rows as it works on a single DataFrame.
Use Dask only if the file size is more than 1GB and data is more than 1 million rows because Dask workers take at least 15-20 seconds to copy files from the main program to the workers' program and thDask 2644 Keywords with 40k rowsen compute logic on it. In short, Dask works very well in the case of extensive data.
Scale your workflows like never before
For the above use case, the Pandas library was more suitable. It takes a few prerequisites, like the file size must be 8 - 100 MB, the number of rows should be lesser than a million, and it should be a single-page application. But it meant not investing time loading files on Dask and keeping Dask workers alive.
If you work on Excel files or any other file-handling operations and face roadblocks or need to understand the most preferred approach, email us at contact@opcito.com, and an expert will ensure you with end-to-end support to run your operations in Excel more efficiently.