Contents
- 1 25 Useful Python Commands for Excel
- 1.1 1. Opening and Loading Workbooks
- 1.2 2. Reading Specific Sheets
- 1.3 3. Iterating Through Rows
- 1.4 4. Writing Data to Cells
- 1.5 5. Data Validation
- 1.6 6. Conditional Formatting
- 1.7 7. Creating Charts
- 1.8 8. Merging Cells
- 1.9 9. Adding Formulas
- 1.10 10. Hiding Rows/Columns
- 1.11 11. Protecting Sheets
- 1.12 12. Auto-width Adjustment
- 1.13 13. Filtering Data
- 1.14 14. Pivot Tables
- 1.15 15. Importing/Exporting JSON Data
- 1.16 16. Applying Styles
- 1.17 17. Handling Missing Data
- 1.18 18. Automating Excel Tasks
- 1.19 19. Grouping Data
- 1.20 20. Importing CSV to Excel
- 1.21 21. Splitting Columns
- 1.22 22. Calculating Statistics
- 1.23 23. Creating New Sheets
- 1.24 24. Extracting Data Ranges
- 1.25 25. Dynamic Column Names
- 1.26 3 Google Adsense Alternatives for WordPress Websites
- 1.27 How to Install AnTuTu on Android
- 1.28 Why You Should Not Use Windows 7: Risks and Bugs
- 1.29 Securing Your Online Business with DDoS Protected VPS Hosting
- 1.30 15+ Best Things to Buy with Temu
- 1.31 Ditching Windows 7 in 2024: A Comprehensive Guide
25 Useful Python Commands for Excel
Master Excel with 25 useful Python commands. This guide offers practical tips for DIYers looking to optimize their spreadsheets. Enjoy coding!
1. Opening and Loading Workbooks
To open and load workbooks in Python using openpyxl and pandas:
With openpyxl:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active # or sheet = workbook["Sheet1"]
With pandas:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet1")
For multiple sheets:
all_sheets = pd.read_excel("your-file.xlsx", sheet_name=None)
For large files, use read-only mode or chunking:
workbook = load_workbook(filename="your-file.xlsx", read_only=True)
# Or with pandas
for chunk in pd.read_excel("your-file.xlsx", sheet_name="Sheet1", chunksize=1000):
process(chunk)
2. Reading Specific Sheets
To access specific sheets in an Excel workbook:
Using openpyxl:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook["Sheet2"]
# Or by index
sheet_name = workbook.sheetnames[1]
sheet = workbook[sheet_name]
Using pandas:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet2")
# Or by index
df = pd.read_excel("your-file.xlsx", sheet_name=1)
# Load all sheets
all_sheets = pd.read_excel("your-file.xlsx", sheet_name=None)
df = all_sheets["Sheet2"]
3. Iterating Through Rows
To iterate through rows in Excel:
Using openpyxl:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
for row in sheet.iter_rows(min_row=1, max_col=3, max_row=2, values_only=True):
print(row)
Using pandas:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet1")
for index, row in df.iterrows():
print(index, row["Column1"], row["Column2"])
# For better performance
for row in df.itertuples(index=False):
print(row.Column1, row.Column2)
# For large datasets
chunk_size = 1000
for chunk in pd.read_excel("your-file.xlsx", sheet_name="Sheet1", chunksize=chunk_size):
for index, row in chunk.iterrows():
print(index, row["Column1"], row["Column2"])
Manipulating Cell Data:
With openpyxl:
sheet["A1"] = "New Value"
workbook.save("your-file.xlsx")
# Batch operation
for row in sheet.iter_rows(min_row=2, max_row=10, min_col=1, max_col=3):
for cell in row:
cell.value = cell.value * 2
workbook.save("your-file.xlsx")
With pandas:
df["Column1"] = df["Column1"].apply(lambda x: x * 2)
df.to_excel("your-file_modified.xlsx", index=False)
# Or iteratively
for index, row in df.iterrows():
df.at[index, "Column1"] = row["Column1"] * 2
df.to_excel("your-file_modified.xlsx", index=False)
For cell formatting with openpyxl:
from openpyxl.styles import Font, PatternFill
cell = sheet["A1"]
cell.font = Font(size=14, bold=True)
cell.fill = PatternFill(start_color="FFFF00", end_color="FFFF00", fill_type="solid")
workbook.save("your-file.xlsx")
4. Writing Data to Cells
To write data to cells in Excel:
Using openpyxl:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
sheet.cell(row=1, column=2, value="Inserted Data")
workbook.save("your-file.xlsx")
# Append rows
new_data = ["A2", "B2", "C2"]
sheet.append(new_data)
workbook.save("your-file.xlsx")
# Dynamic updates
for row in range(2, sheet.max_row + 1):
cell_value = sheet.cell(row=row, column=2).value
sheet.cell(row=row, column=2, value=cell_value * 2)
workbook.save("your-file.xlsx")
Using pandas:
import pandas as pd
data = {'Column1': [10, 20], 'Column2': [30, 40]}
df = pd.DataFrame(data)
df.to_excel("your-file_modified.xlsx", index=False)
# Batch updates
df["Column2"] = df["Column2"] * 2
df.to_excel("your-file_modified.xlsx", index=False)
5. Data Validation
To implement data validation in Excel using openpyxl:
from openpyxl import load_workbook
from openpyxl.worksheet.datavalidation import DataValidation
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
# List validation
dv = DataValidation(type="list", formula1='"Option1,Option2,Option3"', showDropDown=True)
dv.add('A1:A10')
sheet.add_data_validation(dv)
# Whole number range validation
dv = DataValidation(type="whole", operator="between", formula1=1, formula2=10)
dv.add('B1:B10')
sheet.add_data_validation(dv)
# Text length validation
dv = DataValidation(type="textLength", operator="lessThanOrEqual", formula1=10)
dv.add('C1:C10')
sheet.add_data_validation(dv)
workbook.save("your-file.xlsx")
These validations help maintain data integrity by restricting input to predefined criteria.
6. Conditional Formatting
Conditional formatting applies cell styles automatically based on cell values, improving Excel spreadsheet readability. Python’s openpyxl library supports conditional formatting through the ConditionalFormatting
module.
To get started:
from openpyxl import load_workbook
from openpyxl.formatting.rule import FormulaRule
from openpyxl.styles import PatternFill, Font
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
Apply a simple conditional formatting rule:
green_fill = PatternFill(start_color="00FF00", end_color="00FF00", fill_type="solid")
rule = FormulaRule(formula=["A1>100"], fill=green_fill)
sheet.conditional_formatting.add('A1:A10', rule)
workbook.save("your-file.xlsx")
This rule fills cells in column A containing values greater than 100 with a green background.
For more advanced formatting:
green_fill = PatternFill(start_color="00FF00", end_color="00FF00", fill_type="solid")
rule1 = FormulaRule(formula=["A1>100"], fill=green_fill)
red_fill = PatternFill(start_color="FF0000", end_color="FF0000", fill_type="solid")
bold_font = Font(bold=True, color="FFFFFF")
rule2 = FormulaRule(formula=["A1<50"], font=bold_font, fill=red_fill)
sheet.conditional_formatting.add('A1:A10', rule1)
sheet.conditional_formatting.add('A1:A10', rule2)
workbook.save("your-file.xlsx")
This example applies different rules based on cell values, enabling more nuanced data presentations.
Conditional formatting in openpyxl can be customized to fit various needs, from highlighting specific cells to creating data bars or using complex formulas. By integrating these techniques, your Excel files will convey data more effectively and ensure critical values stand out.
7. Creating Charts
Charts and graphs can dramatically improve the understandability of your Excel spreadsheets. Python libraries like openpyxl and pandas, combined with matplotlib, offer powerful tools for generating visual representations of your data.
Using openpyxl to create a bar chart:
from openpyxl import Workbook
from openpyxl.chart import BarChart, Reference
workbook = Workbook()
sheet = workbook.active
data = [
['Item', 'Value'],
['Item A', 30],
['Item B', 60],
['Item C', 90]
]
for row in data:
sheet.append(row)
chart = BarChart()
values = Reference(sheet, min_col=2, min_row=1, max_col=2, max_row=4)
categories = Reference(sheet, min_col=1, min_row=2, max_row=4)
chart.add_data(values, titles_from_data=True)
chart.set_categories(categories)
chart.title = "Sample Bar Chart"
chart.x_axis.title = "Items"
chart.y_axis.title = "Values"
sheet.add_chart(chart, "E5")
workbook.save("chart.xlsx")
Using pandas with matplotlib for more flexibility:
import pandas as pd
import matplotlib.pyplot as plt
data = {
'Item': ['Item A', 'Item B', 'Item C'],
'Value': [30, 60, 90]
}
df = pd.DataFrame(data)
df.plot(kind='bar', x='Item', y='Value', title='Sample Bar Chart')
plt.xlabel('Items')
plt.ylabel('Values')
plt.savefig("pandas_chart.png")
For a pie chart using openpyxl:
from openpyxl.chart import PieChart
chart = PieChart()
labels = Reference(sheet, min_col=1, min_row=2, max_row=4)
data = Reference(sheet, min_col=2, min_row=1, max_row=4)
chart.add_data(data, titles_from_data=True)
chart.set_categories(labels)
chart.title = "Sample Pie Chart"
sheet.add_chart(chart, "E15")
workbook.save("pie_chart.xlsx")
These libraries allow you to transform raw data into insightful visualizations efficiently, enhancing reports, dashboards, and data-driven documents.
8. Merging Cells
Merging cells can significantly improve the readability of your Excel spreadsheets. Python’s openpyxl library provides a straightforward way to merge cells using the merge_cells()
method.
To start:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
Merging cells A1 to C1:
sheet.merge_cells('A1:C1')
sheet['A1'] = "Merged Header"
workbook.save("your-file.xlsx")
To unmerge cells:
sheet.unmerge_cells('A1:C1')
workbook.save("your-file.xlsx")
Merging a block of cells:
sheet.merge_cells('A1:C3')
sheet['A1'] = "Merged Block"
workbook.save("your-file.xlsx")
Styling merged cells:
from openpyxl.styles import Font, PatternFill
sheet['A1'].font = Font(size=14, bold=True)
sheet['A1'].fill = PatternFill(start_color='FFDD00', end_color='FFDD00', fill_type='solid')
workbook.save("your-file.xlsx")
These techniques can enhance the layout and presentation of your Excel files, making them more organized and easier to read.
9. Adding Formulas
Incorporating formulas into Excel cells allows for dynamic calculations that update automatically as data changes. Python makes it straightforward to insert and manage these formulas programmatically.
Using openpyxl to insert formulas:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
sheet["D1"] = "=SUM(A1:C1)"
sheet["E1"] = "=AVERAGE(A1:A10)"
workbook.save("your-file.xlsx")
Using pandas with formulas:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet1")
with pd.ExcelWriter("your-file_with_formulas.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Sheet1", index=False)
workbook = writer.book
sheet = workbook["Sheet1"]
sheet["D1"] = "=SUM(A1:C1)"
sheet["E1"] = "=AVERAGE(A1:A10)"
writer.save()
More complex formulas:
sheet["F1"] = "=VLOOKUP(A1, B1:C10, 2, FALSE)"
sheet["G1"] = "=IF(A1>50, 'Pass', 'Fail')"
workbook.save("your-file.xlsx")
By integrating formulas, you automate calculations and logical operations within your Excel sheets, ensuring they dynamically respond to data changes. This enhances the interactivity and analytical depth of your spreadsheets.
Common Excel Formulas
- SUM: Adds up a range of cells
- AVERAGE: Calculates the mean of a range of cells
- COUNT: Counts the number of cells containing numbers
- VLOOKUP: Searches for a value in a table and returns a corresponding value
- IF: Performs a logical test and returns different values based on the result
These formulas are just the tip of the iceberg. Excel offers a vast array of functions for financial analysis, statistical calculations, and data manipulation that can be leveraged through Python.
10. Hiding Rows/Columns
Hiding rows or columns in Excel can simplify your view, making the spreadsheet more manageable. Openpyxl allows you to programmatically hide rows or columns.
To begin, load your workbook and select the active sheet:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
Hiding Columns
To hide a specific column, adjust the hidden
attribute of the column dimension:
# Hide column B
sheet.column_dimensions['B'].hidden = True
workbook.save("your-file.xlsx")
You can hide multiple columns by repeating the process:
# Hide columns B and D
sheet.column_dimensions['B'].hidden = True
sheet.column_dimensions['D'].hidden = True
workbook.save("your-file.xlsx")
Hiding Rows
To hide rows, use the row_dimensions
attribute:
# Hide row 3
sheet.row_dimensions[3].hidden = True
workbook.save("your-file.xlsx")
For multiple rows:
# Hide rows 3 and 5
sheet.row_dimensions[3].hidden = True
sheet.row_dimensions[5].hidden = True
workbook.save("your-file.xlsx")
Combining Row and Column Hiding
You can hide both rows and columns together:
# Hide column B and rows 3 to 5
sheet.column_dimensions['B'].hidden = True
for i in range(3, 6):
sheet.row_dimensions[i].hidden = True
workbook.save("your-file.xlsx")
Unhiding Rows and Columns
To make hidden rows or columns visible again, set the hidden
attribute to False
:
# Unhide column B and rows 3 to 5
sheet.column_dimensions['B'].hidden = False
for i in range(3, 6):
sheet.row_dimensions[i].hidden = False
workbook.save("your-file.xlsx")
Using these techniques, you can create clean, professional spreadsheets tailored to your audience’s needs.
11. Protecting Sheets
Protecting Excel sheets can ensure data integrity and prevent unauthorized edits. Openpyxl provides methods to protect worksheets and specific ranges.
To start, load your workbook and activate the sheet:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
Locking Entire Sheets
To lock an entire sheet with a password:
sheet.protection.sheet = True
sheet.protection.password = 'secure_password'
workbook.save("your-file.xlsx")
Customizing Protection Options
You can adjust protection settings to allow certain actions while restricting others:
sheet.protection.enable()
sheet.protection.sort = True
sheet.protection.formatCells = True
sheet.protection.insertRows = False
sheet.protection.deleteColumns = False
workbook.save("your-file.xlsx")
Locking Specific Cells
To protect particular cells or ranges:
from openpyxl.styles import Protection
# Unlock all cells
for row in sheet.iter_rows():
for cell in row:
cell.protection = Protection(locked=False)
# Lock cells in the range A1 to C1
for row in sheet.iter_rows(min_row=1, max_row=1, min_col=1, max_col=3):
for cell in row:
cell.protection = Protection(locked=True)
sheet.protection.enable()
sheet.protection.password = 'secure_password'
workbook.save("your-file.xlsx")
Advanced Protection Customization
For non-contiguous ranges or different protection settings:
# Unlock all cells first
for row in sheet.iter_rows():
for cell in row:
cell.protection = Protection(locked=False)
# Protect specific ranges
for row in sheet.iter_rows(min_row=1, max_row=1, min_col=1, max_col=3):
for cell in row:
cell.protection = Protection(locked=True)
for row in sheet.iter_rows(min_row=3, max_row=5, min_col=2, max_col=4):
for cell in row:
cell.protection = Protection(locked=True)
sheet.protection.enable()
sheet.protection.password = 'secure_password'
workbook.save("your-file.xlsx")
These protection features help maintain data integrity, especially in collaborative environments or when sharing sensitive information.
12. Auto-width Adjustment
Automatically adjusting column widths in Excel can improve readability and appearance. The xlsxwriter library allows for auto-width adjustment during file creation.
First, install xlsxwriter:
pip install xlsxwriter
Here’s an example of how to create a workbook with auto-adjusted column widths:
import xlsxwriter
workbook = xlsxwriter.Workbook('auto_width.xlsx')
worksheet = workbook.add_worksheet()
data = [
['Header1', 'Header2', 'Header3'],
['Short', 'A bit longer text', 'This is the longest piece of text in this row'],
['Tiny', 'Medium length text here', 'Shortest']
]
for row_num, row_data in enumerate(data):
for col_num, col_data in enumerate(row_data):
worksheet.write(row_num, col_num, col_data)
for col_num in range(len(data[0])):
col_width = max(len(str(data[row_num][col_num])) for row_num in range(len(data)))
worksheet.set_column(col_num, col_num, col_width)
workbook.close()
This script:
- Creates a new workbook and worksheet
- Inserts sample data
- Calculates the maximum content length for each column
- Adjusts column widths accordingly
You can add extra space for better readability:
buffer_space = 2
for col_num in range(len(data[0])):
col_width = max(len(str(data[row_num][col_num])) for row_num in range(len(data))) + buffer_space
worksheet.set_column(col_num, col_num, col_width)
Using auto-width adjustment ensures your spreadsheets are functional and visually appealing, enhancing data representation and analysis.
13. Filtering Data
Filtering data is a useful technique for focusing on specific subsets of your dataset. Python’s pandas library offers capabilities for efficient data filtering, which is helpful for data analysis, preparation, or extraction tasks.
To get started, import pandas and read your Excel file into a DataFrame:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet1")
Common filtering methods:
- Filtering Rows by Column Values
Use boolean indexing to filter rows where a certain column meets specific conditions:
filtered_df = df[df["Age"] > 25] print(filtered_df)
- Combining Multiple Conditions
Use logical operators
&
(and),|
(or), and~
(not) for multiple conditions:filtered_df = df[(df["Age"] > 25) & (df["Gender"] == "Male")] print(filtered_df)
- Using
query()
for Enhanced ReadabilityThe
query()
method provides a more readable syntax:filtered_df = df.query("Age > 25 and Gender == 'Male'") print(filtered_df)
- Filtering Columns
Select specific columns in your resultant DataFrame:
filtered_columns_df = df[["Name", "Age"]] print(filtered_columns_df)
- Using
isin()
for Set-based FilteringFilter based on multiple values in a column:
filtered_df = df[df["City"].isin(["New York", "Los Angeles"])] print(filtered_df)
- Handling Missing Data
Remove rows with missing values or fill them with a specified value:
clean_df = df.dropna() filled_df = df.fillna(0)
These methods help you manipulate and extract specific data views from large datasets, enabling more focused analysis and better data management.
14. Pivot Tables
Pivot tables are powerful tools for summarizing large datasets. Python’s pandas library simplifies the creation of pivot tables, allowing you to generate summaries and insights efficiently.
To begin, import pandas and load your Excel file into a DataFrame:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet1")
Creating and Manipulating Pivot Tables:
- Creating a Basic Pivot Table
Use the
pivot_table()
method to summarize data:pivot_table = pd.pivot_table( df, values='Sales', index='Region', columns='Product Category', aggfunc='sum' ) print(pivot_table)
- Adding Multiple Aggregation Functions
Analyze data using multiple functions at once:
pivot_table = pd.pivot_table( df, values='Sales', index='Region', columns='Product Category', aggfunc=['sum', 'mean'] ) print(pivot_table)
- Handling Missing Data
Fill in default values for missing data:
pivot_table = pd.pivot_table( df, values='Sales', index='Region', columns='Product Category', aggfunc='sum', fill_value=0 ) print(pivot_table)
- Adding Margins for Totals
Include row and column totals:
pivot_table = pd.pivot_table( df, values='Sales', index='Region', columns='Product Category', aggfunc='sum', margins=True ) print(pivot_table)
- Using Multiple Indexes
Group data by more than one index:
pivot_table = pd.pivot_table( df, values='Sales', index=['Region', 'Salesperson'], columns='Product Category', aggfunc='sum' ) print(pivot_table)
- Visualizing Pivot Tables
Plot pivot tables for visual insights:
import matplotlib.pyplot as plt pivot_table.plot(kind='bar', figsize=(10, 5)) plt.title('Sales by Region and Product Category') plt.xlabel('Region') plt.ylabel('Sales') plt.show()
By using pandas for pivot tables, you can transform complex datasets into insightful summaries, enhancing your data analysis and reporting capabilities.
15. Importing/Exporting JSON Data
Importing and exporting JSON (JavaScript Object Notation) data is useful for modern data handling. Python’s pandas library simplifies the conversion of JSON data into Excel and vice versa.
Importing JSON Data into Excel
Load JSON data into a DataFrame:
import pandas as pd
json_data = pd.read_json("data.json")
print(json_data.head())
For nested JSON data:
normalized_data = pd.json_normalize(json_data['nested_field'])
print(normalized_data.head())
Export to Excel:
json_data.to_excel("data.xlsx", index=False)
Exporting DataFrame to JSON
Load Excel data into a DataFrame:
df = pd.read_excel("data.xlsx")
Convert DataFrame to JSON:
json_str = df.to_json()
with open("data.json", "w") as json_file:
json_file.write(json_str)
Customizing JSON Output
Generate more readable JSON:
json_str = df.to_json(orient="records", indent=4)
with open("data_pretty.json", "w") as json_file:
json_file.write(json_str)
Handling Complex Data Structures
For nested data:
nested_df = pd.DataFrame({
"id": [1, 2],
"info": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]
})
nested_json_str = nested_df.to_json(orient="records", lines=True)
print(nested_json_str)
nested_json_df = pd.read_json(nested_json_str, lines=True)
print(nested_json_df)
Integration with Web APIs
Fetch JSON data from web APIs:
import requests
response = requests.get("https://api.sampleendpoint.com/data")
json_data = response.json()
df = pd.json_normalize(json_data)
print(df.head())
df.to_excel("web_data.xlsx", index=False)
Using pandas for importing and exporting JSON data allows for smooth transitions between JSON and Excel formats, enhancing data handling capabilities across different platforms and applications.
16. Applying Styles
Enhancing the visual appeal of Excel spreadsheets can improve readability and user experience. Python’s openpyxl library provides ways to apply styles to cells, including changing fonts, altering cell background colors, and adding borders.
To begin, import the necessary modules and load your workbook:
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Border, Side
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
Applying Font Styles
Modify the font properties of a cell using the Font
class:
cell = sheet["A1"]
cell.font = Font(size=14, bold=True, color="FF0000") # Red Bold Font, Size 14
sheet["A1"] = "Styled Text"
workbook.save("your-file.xlsx")
Changing Cell Background Colors
Alter the background color of a cell using the PatternFill
class:
cell = sheet["B2"]
cell.fill = PatternFill(start_color="FFFF00", end_color="FFFF00", fill_type="solid")
sheet["B2"] = "Highlighted"
workbook.save("your-file.xlsx")
Adding Borders to Cells
Add borders around cells using the Border
and Side
classes:
thin_border = Border(left=Side(style='thin', color="000000"),
right=Side(style='thin', color="000000"),
top=Side(style='thin', color="000000"),
bottom=Side(style='thin', color="000000"))
cell = sheet["C3"]
cell.border = thin_border
sheet["C3"] = "Bordered Cell"
workbook.save("your-file.xlsx")
Combining Multiple Styles
Combine font styles, background colors, and borders to fully customize a cell:
cell = sheet["D4"]
cell.font = Font(size=12, italic=True, color="0000FF") # Blue Italic Font, Size 12
cell.fill = PatternFill(start_color="FFDDC1", end_color="FFDDC1", fill_type="solid")
cell.border = Border(left=Side(style='thick', color="DD0000"),
right=Side(style='thick', color="DD0000"),
top=Side(style='thick', color="DD0000"),
bottom=Side(style='thick', color="DD0000"))
sheet["D4"] = "Custom Styled"
workbook.save("your-file.xlsx")
Styling Columns and Rows
Apply styles to entire columns or rows:
for cell in sheet["E"]:
cell.font = Font(bold=True, color="008000") # Green Bold Font
cell.fill = PatternFill(start_color="D3FFD3", end_color="D3FFD3", fill_type="solid") # Light Green Background
workbook.save("your-file.xlsx")
By using these styling capabilities, you can enhance the aesthetics of your Excel files, making them easier to read and interpret.
17. Handling Missing Data
Working with real-world datasets often involves encountering missing data. Python’s pandas library offers methods such as fillna()
and dropna()
to manage missing data effectively.
Using the fillna()
Method
The fillna()
function replaces missing values with a specified value:
import pandas as pd
# Load data into a DataFrame
df = pd.read_excel("your-file.xlsx")
# Fill missing values with a constant value, such as 0
df_filled = df.fillna(0)
print(df_filled.head())
# Fill missing values with the mean of the column
df_filled_mean = df.fillna(df.mean())
print(df_filled_mean.head())
Advanced fillna()
Techniques
Use forward fill (method='ffill'
) and backward fill (method='bfill'
) for more advanced data imputation:
# Forward fill: propagate last observed value forward
df_ffill = df.fillna(method='ffill')
print(df_ffill.head())
# Backward fill: propagate next observed value backward
df_bfill = df.fillna(method='bfill')
print(df_bfill.head())
Using the dropna()
Method
The dropna()
method removes rows or columns with missing data:
# Drop rows with any missing values
df_dropped = df.dropna()
print(df_dropped.head())
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns.head())
# Drop rows where all values are missing
df_dropped_all = df.dropna(how='all')
print(df_dropped_all.head())
Handling Incomplete Data with Conditional Drops
Use the subset parameter in dropna()
to specify which columns to consider:
# Drop rows if any value in specified columns is missing
df_dropped_subset = df.dropna(subset=['Column1', 'Column2'])
print(df_dropped_subset.head())
Effective handling of missing data is crucial for maintaining the accuracy and reliability of your dataset. These techniques offer the flexibility to prepare your data for analysis.
18. Automating Excel Tasks
Python’s openpyxl and pandas libraries provide tools to script Excel automation, allowing you to streamline workflows and enhance productivity.
Automating Data Insertion
Populate a range of cells with incrementing numbers:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
for i in range(1, 11):
sheet[f"A{i}"] = i
workbook.save("your-file.xlsx")
Automating Data Manipulation
Use pandas to apply transformations across an entire column:
import pandas as pd
df = pd.read_excel("your-file.xlsx")
df['New_Column'] = df['Existing_Column'] * 2
df.to_excel("your-file_updated.xlsx", index=False)
Automating Conditional Formatting
Apply conditional formatting to cells based on their values:
from openpyxl.formatting.rule import CellIsRule
from openpyxl.styles import PatternFill
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
red_fill = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid")
rule = CellIsRule(operator="greaterThan", formula=["100"], fill=red_fill)
sheet.conditional_formatting.add('A1:A10', rule)
workbook.save("your-file.xlsx")
Automating Data Validation
Restrict input values in a specific range:
from openpyxl.worksheet.datavalidation import DataValidation
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook.active
dv = DataValidation(type="whole", operator="between", formula1=1, formula2=10)
dv.error = "Your entry is invalid"
dv.errorTitle = "Invalid Entry"
sheet.add_data_validation(dv)
dv.add('B1:B10')
workbook.save("your-file.xlsx")
Automating Report Generation
Generate Excel reports by integrating data collection, analysis, and presentation:
raw_data = pd.read_excel("raw_data.xlsx")
summary = raw_data.describe()
summary.to_excel("summary_report.xlsx")
Automating Merging Multiple Excel Files
Merge multiple files into a single DataFrame:
import glob
file_list = glob.glob("data_folder/*.xlsx")
all_data = pd.DataFrame()
for file in file_list:
df = pd.read_excel(file)
all_data = all_data.append(df, ignore_index=True)
all_data.to_excel("merged_data.xlsx", index=False)
Automating Excel tasks using openpyxl and pandas can save time and ensure consistency across repetitive processes. These libraries provide the tools to transform manual workflows into efficient, automated scripts.
19. Grouping Data
Grouping Data with groupby()
Pandas’ groupby()
function allows you to divide your data based on specific criteria, enabling deeper analysis and revealing trends within different subsets.
Basic Grouping with groupby()
Import pandas and load your dataset:
import pandas as pd
df = pd.read_excel("your-file.xlsx")
Group data by a column:
grouped = df.groupby('Region')
print(grouped.size())
Aggregating Grouped Data
Apply aggregation functions to grouped data:
total_sales_by_region = grouped['Sales'].sum()
average_sales_by_region = grouped['Sales'].mean()
Applying Multiple Aggregations
Use agg()
to apply multiple functions:
aggregated_sales = grouped['Sales'].agg(['sum', 'mean', 'max', 'min'])
Grouping by Multiple Columns
Group by multiple columns for more detailed analysis:
grouped_multi = df.groupby(['Region', 'Product Category']).sum()
Transform and Filter Operations
Normalize data within groups or filter based on criteria:
df['Normalized Sales'] = grouped['Sales'].transform(lambda x: (x - x.mean()) / x.std())
high_sales_regions = grouped.filter(lambda x: x['Sales'].sum() > 10000)
Using Custom Functions with apply()
Apply custom functions to groups:
def custom_aggregation(group):
return pd.Series({
'Total Sales': group['Sales'].sum(),
'Average Discount': group['Discount'].mean()
})
custom_grouped = grouped.apply(custom_aggregation)
Saving Grouped Data
Export aggregated data to Excel:
aggregated_sales.to_excel("aggregated_sales.xlsx", index=True)
By using groupby()
, you can effectively segment and analyze your data, transforming raw information into meaningful insights for informed decision-making and detailed reporting.
20. Importing CSV to Excel
Converting CSV Files to Excel Format Using Pandas
Python’s pandas library offers an efficient way to convert CSV files to Excel format.
Importing CSV Data
import pandas as pd
df = pd.read_csv("your-data.csv")
print(df.head())
Exporting to Excel
df.to_excel("your-data.xlsx", index=False, sheet_name="Sheet1")
Handling CSV Variations
For different delimiters:
df = pd.read_csv("your-data.csv", delimiter=';')
For files without headers:
df = pd.read_csv("your-data.csv", header=None)
df.columns = ["Column1", "Column2", "Column3"]
Handling Large CSV Files
Process large files in chunks:
chunk_size = 1000
chunk_list = []
for chunk in pd.read_csv("your-data.csv", chunksize=chunk_size):
chunk_list.append(chunk)
df = pd.concat(chunk_list)
df.to_excel("large-data.xlsx", index=False)
Customizing the Excel Output
selected_columns = df[["Column1", "Column3"]]
with pd.ExcelWriter("custom-data.xlsx", engine="xlsxwriter") as writer:
selected_columns.to_excel(writer, index=False, sheet_name="SelectedData")
workbook = writer.book
worksheet = writer.sheets["SelectedData"]
format1 = workbook.add_format({'num_format': '#,##0.00'})
worksheet.set_column('A:A', None, format1)
Preserving Data Types
df = pd.read_csv("your-data.csv", dtype={"Column1": float, "Column2": str})
By using pandas to convert CSV files to Excel format, you can efficiently transition from raw data to structured spreadsheets, enhancing data accessibility for analysis and reporting.
21. Splitting Columns
Splitting Columns
Pandas’ str.split()
method allows you to separate cell contents into multiple columns based on a specified delimiter.
Load your dataset:
import pandas as pd
df = pd.read_excel("your-file.xlsx")
Split a “Full Name” column:
df[['First Name', 'Last Name']] = df['Full Name'].str.split(' ', expand=True)
df.drop(columns=['Full Name'], inplace=True)
df.to_excel("split_columns.xlsx", index=False)
Split a comma-separated column:
df[['Street', 'City', 'State']] = df['Address'].str.split(',', expand=True)
Use regular expressions for complex splitting:
import re
df[['Area Code', 'Phone Number']] = df['Contact'].str.split(r'[()-]', expand=True)
Split URLs:
df['URL'] = ['https://example.com/path/to/page', 'http://another-example.org/home']
df = df['URL'].str.split('/', expand=True)
df.columns = ['Protocol', 'Empty', 'Domain', 'Path1', 'Path2', 'Path3']
df.drop(columns=['Empty'], inplace=True)
By using str.split()
, you can effectively manage and manipulate data contained within single columns, transforming it into a more usable and structured format. This approach cleans up datasets and facilitates more precise data analysis and reporting.
22. Calculating Statistics
Deriving basic statistics such as mean, median, and mode is essential in data analysis. Python’s pandas library offers efficient methods to calculate these statistics.
Calculating Mean
To calculate the mean of a column in your DataFrame:
import pandas as pd
df = pd.read_excel("your-file.xlsx")
mean_value = df['Column_Name'].mean()
print(f"Mean: {mean_value}")
Calculating Median
To compute the median:
median_value = df['Column_Name'].median()
print(f"Median: {median_value}")
Calculating Mode
To determine the mode:
mode_value = df['Column_Name'].mode()
print(f"Mode: {mode_value}")
Aggregating Multiple Statistics
For a summary of various statistics:
summary = df.describe()
print(summary)
Custom Aggregation using agg()
For specific statistics:
custom_stats = df.agg({
'Column_Name': ['mean', 'median', lambda x: x.mode().iloc[0]]
})
print(custom_stats)
Handling NaN Values
To handle missing values:
mean_ignore_nan = df['Column_Name'].mean(skipna=True)
mean_fill_nan = df['Column_Name'].fillna(0).mean()
print(f"Mean ignoring NaN: {mean_ignore_nan}")
print(f"Mean filling NaN with 0: {mean_fill_nan}")
These methods allow you to derive insights from your data efficiently.
23. Creating New Sheets
Adding new sheets programmatically in an Excel workbook can be useful for segmenting data or logging data over time. Python’s openpyxl library provides the create_sheet()
method for this purpose.
To start, import openpyxl and load your workbook:
from openpyxl import Workbook, load_workbook
try:
workbook = load_workbook(filename="your-file.xlsx")
except FileNotFoundError:
workbook = Workbook()
To add a new sheet:
worksheet_summary = workbook.create_sheet(title="Summary")
workbook.save(filename="your-file.xlsx")
You can specify the position of the new sheet:
worksheet_first = workbook.create_sheet(title="First Sheet", index=0)
workbook.save(filename="your-file.xlsx")
Populating New Sheets with Data
To add data to the new sheet:
worksheet_summary = workbook["Summary"]
worksheet_summary["A1"] = "Category"
worksheet_summary["B1"] = "Total Sales"
worksheet_summary.append(["Electronics", 15000])
worksheet_summary.append(["Books", 7500])
worksheet_summary.append(["Clothing", 12000])
workbook.save(filename="your-file.xlsx")
Customizing New Sheets
To style the new sheet:
from openpyxl.styles import Font
bold_font = Font(bold=True)
worksheet_summary["A1"].font = bold_font
worksheet_summary["B1"].font = bold_font
worksheet_summary.column_dimensions['A'].width = 20
workbook.save(filename="your-file.xlsx")
Creating Multiple Sheets Based on Data
To create sheets dynamically based on a DataFrame:
import pandas as pd
df = pd.DataFrame({
'Category': ['Electronics', 'Books', 'Clothing'],
'Total Sales': [15000, 7500, 12000]
})
for index, row in df.iterrows():
sheet_name = row['Category']
worksheet = workbook.create_sheet(title=sheet_name)
worksheet.append(['Category', 'Total Sales'])
worksheet.append([row['Category'], row['Total Sales']])
workbook.save(filename="your-file.xlsx")
This feature allows for efficient management of Excel workbooks, enhancing organization and data structure.
24. Extracting Data Ranges
Extracting specific data ranges can improve analysis efficiency. Python’s openpyxl and pandas libraries provide methods for working with data ranges.
Using openpyxl
To extract a range using openpyxl:
from openpyxl import load_workbook
workbook = load_workbook(filename="your-file.xlsx")
sheet = workbook["Sheet1"]
data_range = sheet["A1:C10"]
for row in data_range:
for cell in row:
print(cell.value, end=" ")
print()
Using pandas
To extract a range using pandas:
import pandas as pd
df = pd.read_excel("your-file.xlsx", sheet_name="Sheet1")
data_range = df.iloc[0:10, 0:3]
print(data_range)
Dynamic Range Specification
To extract data based on conditions:
conditional_range = df[df['Sales'] > 500]
print(conditional_range)
Range Selection Based on Headers
To select ranges using column names:
header_range = df.loc[0:9, ['Category', 'Region', 'Sales']]
print(header_range)
Combining Row and Column Conditions
For more complex data operations:
combined_range = df.loc[df['Region'] == 'West', ['Product', 'Sales']]
print(combined_range)
Saving Extracted Ranges
To save the extracted data:
combined_range.to_excel("focused_data.xlsx", index=False)
Applying Functions to Data Ranges
To perform calculations on extracted data:
total_sales = combined_range['Sales'].sum()
print(f"Total Sales: {total_sales}")
These techniques allow for precise and efficient data manipulation, enhancing productivity and streamlining workflows.
25. Dynamic Column Names
Dynamic column names are useful when working with changing datasets or aligning column names with specific requirements. Python’s pandas library provides methods for renaming columns flexibly.
To rename columns, use the rename()
method:
import pandas as pd
# Load dataset
df = pd.read_excel("your-file.yaml")
# Define renaming dictionary
columns_rename_map = {
"OldColumnName1": "NewColumnName1",
"OldColumnName2": "NewColumnName2"
}
# Rename columns
df.rename(columns=columns_rename_map, inplace=True)
For pattern-based renaming:
# Add prefix to all column names
df.columns = ["Prefix_" + col for col in df.columns]
# Use regex to replace parts of column names
df.columns = df.columns.str.replace('Old', 'New', regex=True)
To rename based on external mappings:
# Load column mapping from CSV
column_mappings = pd.read_csv("column_mappings.csv")
columns_rename_map = dict(zip(column_mappings['OldName'], column_mappings['NewName']))
df.rename(columns=columns_rename_map, inplace=True)
For conditional renaming, apply a function:
def transform_column_name(col_name):
return col_name.replace("Old", "New") if "Old" in col_name else col_name
df.columns = [transform_column_name(col) for col in df.columns]
To read column structures from configuration files:
import json
with open("column_config.json", "r") as file:
columns_rename_map = json.load(file)
df.rename(columns=columns_rename_map, inplace=True)
For MultiIndex DataFrames:
# Create MultiIndex DataFrame
arrays = [["A", "A", "B", "B"], ["one", "two", "one", "two"]]
index = pd.MultiIndex.from_arrays(arrays, names=['upper', 'lower'])
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=index)
# Rename levels
df = df.rename(columns={"A": "Alpha", "B": "Beta"}, level=0)
These techniques help maintain data organization and consistency, especially in dynamic data environments.
Using these Python tools can streamline Excel tasks and improve data management efficiency. These methods provide a structured approach to handling spreadsheets effectively for automating processes or extracting specific data ranges.
Key Excel Functions for Data Analysis
- SUM: Totals a range of cell values
- AVERAGE: Calculates the mean of selected cells
- COUNT: Counts cells containing numbers in a range
- VLOOKUP: Searches for a value in the leftmost column of a table and returns a corresponding value
- CONCATENATE: Joins multiple text strings into one
Advanced data manipulation techniques in Python, such as pivot tables and merging dataframes, can replicate and enhance many Excel functionalities:
# Creating a pivot table
pivot_df = df.pivot_table(index='Category', values='Sales', aggfunc='sum')
# Merging dataframes
merged_df = pd.merge(df1, df2, on='ID')
By combining Python’s powerful data analysis libraries with Excel’s familiar interface, analysts can create more robust and automated data processing workflows.