Importing Excel Files
DecaSIM can be configured to import and process Excel formatted files.
Types Of Files
Excel imports can be processed to one of two File Types
; LAYOUT
files or OFFER
files. This may change as we roll-out more features. LAYOUT
files are saved in the customer DESIGN FILES
view. OFFER
files are saved in the customer PROMOTION FILES
view.
DecaSIM takes user-defined configuration files to process worksheets
within an Excel workbook.
Each worksheet a user wants to import and process has its own SHEET SPEC
.
SHEET SPEC
A SHEET SPEC
comprises the following:
TARGET SHEET
: The name of the worksheet in the workbook that is going to be processed and imported.HEADER ROW
: The row number where processor will find the header row. This uses zero-based counting so the first row is row0
Cell Config File
: The name of the file containing the process configuration parameters.File Type
: The type of file that will be saved in DecaSIM;LAYOUT
orOFFER
file.Sheet Pattern Match
: Users can also specify a regex string pattern to identify the worksheet to import. This is useful when there is important information contained in the worksheet name that the user wishes to parse.
BASIC CELL CONFIG FILE
The Cell Config File
contains two important sets of instructions. The first is a set of instructions that tells DecaSIM the broad layout of the worksheet. The second is a set of instructions that tells DecaSIM what to do when it encounters specific features and data.
There are two types of worksheet, a DEFAULT
type where each row is a unique row of data and other types where clusters of rows represent an entity and cells within the cluster contain data that describe the entity.
The worksheet import configuration uses the follow to specify which type of layout is being processed:
DEFAULT
;SINGLE_ROW
;RowCheck.CheckCondition
: This configuraton tells the processor to expect that each row represents a single row of data.CONDITION
;TopBorder.Thin
;RowCheck.CheckCondition
: This configuration tells the processor to use aCONDITION
to check if a new row has been identified. If this configuration is used then the config file must also specify where to check this condition. For these files we use the configuration parameterRowCheck.Location
to specify which column the processor to check for theCONDITION
. In this example were are checking columnM
which in zero-based counting is poisition12
so our config file must include a statement like this:12
;0
;RowCheck.Location
.
Irrespective of the worksheet layout, the config must also constrain the columns that the processor includes as it reads each row. The statement 34
;0
;MAX_COLUMN
.Location
is used to specify that the process will include data from column 0 to column 34 (AI
) when it produces the output file.
As well as clusters of rows representing an entity, the processor can also search for clusters of columns that represent sub-entities within each entity. A set of config statements can trigger this behavior.
Element | Value | Type |
---|---|---|
BORDER_RIGHT*THIN ; | OR ; | FeatureGridSearch .PAGE_SPLIT_TYPE .1 |
1 ; | 9 ; | FeatureSearchRange .BORDER_RIGHT*THIN |
COLUMN ; | 0 ; | FeatureSearchType .BORDER_RIGHT*THIN |
However, since the worksheet formatting might not be consistent - sometimes Excel will put a right border on the lefthand side of a cell or group of cells OR a left hand border on the righthand side. We need to capture that in our search configuration.
Element | Value | Type |
---|---|---|
BORDER_LEFT*THIN ; | OR ; | FeatureGridSearch .PAGE_SPLIT_TYPE .2 |
2; | 10; | FeatureSearchRange .BORDER_LEFT*THIN |
COLUMN ; | -1; | FeatureSearchType .BORDER_LEFT*THIN |
Detailed Description
These two tables define a FeatureGridSearch
to locate the PAGE_SPLIT_TYPE
. There are two parts to this search as we need to search for left and right borders within the search range. To ensure the FeatureGridSearch
uses both checks we give each one a different id: ...PAGE_SPLIT_TYPE
.1
and ...PAGE_SPLIT_TYPE
.2
When this FeatureGridSearch
finds a feature, in this case a THIN
border on either the left handside or right hand side of a cell the search returns a column index of the column containing that feature. To avoid double counting we specify the offset for COLUMN
value that is returned. For example if column B (index = 1) we find a righthand border the FeatureGridSearch
returns the value 1. However if the user has added the border feature on the lefthand of column C (index = 2) then the FeatureGridSearch
would return the value 2. This would lead to double counting for the field specified PAGE_SPLIT_TYPE
. To avoid that add the value -1 to the returned column when we search for BORDER_LEFT*THIN
.
Having identified a feature using these conditions, users can label the features that are discovered using the FeatureLabel
configuration.
FEATURE_COUNT.FEATURE INDEX; Label; FeatureLabel
.FIELDNAME.RULE INDEX
Example 1:
Element | Value | Type | Notes |
---|---|---|---|
1.1; | DOUBLE_PAGE; | FeatureLabel.PAGE_SPLIT_TYPE.1; | ** KEEP ** |
This configuration will apply the label DOUBLE_PAGE_ to the first feature discovered when only of these features is found in the ENTITY
Example 2:
Element | Value | Type | Notes |
---|---|---|---|
2.1; | LEFT_HAND_SECTION; | FeatureLabel.PAGE_SPLIT_TYPE.2; | ** KEEP ** |
2.2; | RIGHT_HAND_SECTION; | FeatureLabel.PAGE_SPLIT_TYPE.3; | ** KEEP ** |
In this example the configuration will apply one of two labels (LEFT_HAND_SECTION or RIGHT_HAND_SECTION) depending on whether or not this is the first, 2.1
, or second, 2.2
features discovered.
Example 3:
Element | Value | Type | Notes |
---|---|---|---|
1.1; | DOUBLE_PAGE; | FeatureLabel.PAGE_SPLIT_TYPE.1; | ** KEEP ** |
2.1; | LEFT_HAND_SECTION; | FeatureLabel.PAGE_SPLIT_TYPE.2; | ** KEEP ** |
2.2; | RIGHT_HAND_SECTION; | FeatureLabel.PAGE_SPLIT_TYPE.3; | ** KEEP ** |
3.1; | ROBA; | FeatureLabel.PAGE_SPLIT_TYPE.4; | ** KEEP ** |
3.2; | LEFT_HAND_SECTION; | FeatureLabel.PAGE_SPLIT_TYPE.5; | ** KEEP ** |
3.3; | RIGHT_HAND_SECTION; | FeatureLabel.PAGE_SPLIT_TYPE.6; | ** KEEP ** |
4.1; | LEFT_HAND_LEFT_ROBA; | FeatureLabel.PAGE_SPLIT_TYPE.7; | ** KEEP ** |
4.2; | LEFT_HAND_RIGHT_ROBA; | FeatureLabel.PAGE_SPLIT_TYPE.8; | ** KEEP ** |
4.3; | RIGHT_HAND_LEFT_ROBA; | FeatureLabel.PAGE_SPLIT_TYPE.9; | ** KEEP ** |
4.4; | RIGHT_HAND_RIGHT_ROBA; | FeatureLabel.PAGE_SPLIT_TYPE.10; | ** KEEP ** |
Search Types
There are some simple feature-based searches the configuration can specify:
FeatureSearch
:FeatureGridSearch
: Searches a cluster of rows in a grid fashion (top to bottom then left to right) looking for a feature.FeatureLabelSearch
: Extracts content based on Features within specific parts of the excel that have been labeled already.
What Is A Feature
A feature is combination of feature and position within a cluster of columns. The way the data is constructed is hierarchical, configurations can be specified to look for:
- formats,
- formats + row position,
- formats + row position + column position
Where specified the row position is always relative to the start of the cluster of rows that represents an entity.
FeatureLabel Search
In the above example we used the configuration to perform a FeatureGrid
search to identify different sections of the Excel document and classify them into a field called PAGE_SPLIT_TYPE. We can now perform a FeatureLabel
search and identify features within these labelled areas of the Excel sheet.
To specify a FeatureLabel
search we add a line to the config file:
Feature To Search For
; Search Type
; FeatureLabelSearch.Label
.FieldName
Example 1
For example we add the following config line:
- FONT_COLOR*FFFF00FF;OR;FeatureLabelSearch.PAGE_SPLIT_TYPE.ESPECIFICATIONS
This gives the following instructions to DecaSIM
:
- Search For Feature => FONT_COLOR*FFFF00FF
- Search Type: OR => only search for this feature in each cell.
- Search within areas of the Excel Structure that have been labelled with a PAGE_SPLIT_TYPE field.
- Append content to a new field called ESPECIFICATIONS
Example 2
We add two config lines to the file:
- FONT_COLOR*FF000000;AND;FeatureLabelSearch.PAGE_SPLIT_TYPE.CATEGORY
- BACKGROUND_COLOR*FFFFFFFF;AND;FeatureLabelSearch.PAGE_SPLIT_TYPE.CATEGORY
This configuraton is using the Search Type
keyword AND. This tells DecaSIM
to evaluate all config rules during the search. Each must be true for a feature to be extracted and assigned to the new field, CATEGORY.
Feature Search
Example 1
To search for a feature in a specific column
These config lines tell DecaSIM
to search for black font FF000000
in Column 14 (N
Column in Excel) in the First, 2nd, 3rd...Last row of the section in Excel. The results of the search are then added to a column called CATEGORY.
FONT_COLOR*FF000000*FIRST*14
;OR;FeatureSearch.CATEGORY.1FONT_COLOR*FF000000*2*14
;OR;FeatureSearch.CATEGORY.2FONT_COLOR*FF000000*3*14
;OR;FeatureSearch.CATEGORY.3FONT_COLOR*FF000000*4*14
;OR;FeatureSearch.CATEGORY.4FONT_COLOR*FF000000*5*14
;OR;FeatureSearch.CATEGORY.5FONT_COLOR*FF000000*6*14
;OR;FeatureSearch.CATEGORY.6FONT_COLOR*FF000000*7*14
;OR;FeatureSearch.CATEGORY.7FONT_COLOR*FF000000*LAST*14
;OR;FeatureSearch.CATEGORY.8
Example 2
In this example we are extracting content based on features and constructing an array of data. For each member of the OR Array a new row is created with an extra column labelled COUNTRY_CODE. This column contains the field Header Field = Content
. In the example below there are 11 elements in the OR_ARRAY, this config will generate 11 rows of data - each row contain the unique content from this OR_ARRAY.
FONT_COLOR*FF000000*FIRST*15;OR_ARRAY;FeatureSearch.COUNTRY_CODE.1
FONT_COLOR*FF000000*FIRST*17;OR_ARRAY;FeatureSearch.COUNTRY_CODE.2
FONT_COLOR*FF000000*FIRST*19;OR_ARRAY;FeatureSearch.COUNTRY_CODE.3
FONT_COLOR*FF000000*FIRST*21;OR_ARRAY;FeatureSearch.COUNTRY_CODE.4
FONT_COLOR*FF000000*FIRST*23;OR_ARRAY;FeatureSearch.COUNTRY_CODE.5
FONT_COLOR*FF000000*FIRST*25;OR_ARRAY;FeatureSearch.COUNTRY_CODE.6
FONT_COLOR*FF000000*FIRST*27;OR_ARRAY;FeatureSearch.COUNTRY_CODE.7
FONT_COLOR*FF000000*FIRST*29;OR_ARRAY;FeatureSearch.COUNTRY_CODE.8
FONT_COLOR*FF000000*FIRST*31;OR_ARRAY;FeatureSearch.COUNTRY_CODE.9
FONT_COLOR*FF000000*FIRST*33;OR_ARRAY;FeatureSearch.COUNTRY_CODE.10
FONT_COLOR*FF000000*FIRST*35;OR_ARRAY;FeatureSearch.COUNTRY_CODE.11