Discussion

Apache POI – Parsing Excel Workbook

Requirement :

The requirement was to parse an excel file which is in tabular format but number of columns, column names can be different for each excel, i.e. an unstructured excel file.

Since API activities like pxParseExcelFile , MSOParseExcel works with only excel files with a predefined structure, we needed a custom solution to fulfill our requirement. We have used Apache POI api to implement this.

Notes :

Apache POI supports both XLS and XLSX type of files.

API Docs for Apache POI:

https://poi.apache.org/apidocs/dev/

  • The solution used only XSSFWorkbook class to parse XLSX file but the same code can be extended with HSSFWorkbook to work with XLS files too.
  • The Java code has been attached in the post, and that can be reused as a function.

Here are some documentation about the java code.

  • To read an XLSX file :
  1. PRInputStream myxls = new PRInputStream(FileName);
  2. org.apache.poi.xssf.usermodel.XSSFWorkbook wb = new org.apache.poi.xssf.usermodel.XSSFWorkbook(myxls);

Filename can be set from pxRequestor.pyFileUpload (pzFilePathWithForm control updates this property with file location string like : "file://web:/StaticContent/global/ServiceExport/filename.xlsx" )

  • To get the number of sheets in a workbook
 int max_sheets = wb.getNumberOfSheets();
  • To get the name of current sheet
String currSheetname = wb.getSheetName(sheetnum) ;   
  • To access a specific sheet:
org.apache.poi.xssf.usermodel.XSSFSheet sheet = wb.getSheetAt(sheetnum);
  • To get a specific row :
org.apache.poi.xssf.usermodel.XSSFRow row = sheet.getRow(j);
  • To get data in a specific cell:
org.apache.poi.xssf.usermodel.XSSFCell cellData = row.getCell(k); -- K represents the number of the cell 

Output :

The java function will parse the excel and store the data in a pagelist inside a pagelist structure.

We are storing column level information in a pagelist because the number of columns is not predefined.

tempParsedData.pxResults() -- will contain data for each row and .pxResults(1) will have column names

-->.SingleValueProp() -- a pagelist property for each .pxResults() which hold information cellwise i.e. column wise.

Comments

Keep up to date on this post and subscribe to comments