Profiling and Approving a New Data Load

Estimated reading: 6 minutes 75 views

Figure 3: New Data Load Cluster Profiling window

Primary Key Selection

The primary key column is a very important column in a Dataset. It should contain a unique value each individual row with a format that is consistent across every row. An example would be a simple integer serial or “count” key: (1,2,3,4,5,…n). If none of the values here seem suitable as a primary key, you may wish to go back to Inflow and use the Primary Key transform to add a new Primary Key column to your dataset, then run the InFlow again to update the data load.

Below the Primary Key, you can set the following options for the each column in the data load.

Inzata reads in the column headers and records them in the Column Name field. These cannot be modified. Inzata’s AI inspects the data contents columns and detects most likely column type for each column (attribute, fact, date, or ident). A user can change or update them if they wish.

Label – Inzata gives you the ability to create a custom label or alias for each column. The orginal column name is recorded, and the new label will be how that column will be identified inside of Inzata. A great use for labels is to give columns a friendlier name that will be more easily recognizable to users. The value defaults to the original Column Name. Double quotes or apostrophes are not available to be used in object names. These column labels can contain spaces.

Tips for using labels:

When a column is the same within different source files, e.g. “Customer” in both the “Order” and “Payment” clusters, you can use the same name in all source files. Then the same attribute will be generated in a project, and it will be part of several clusters.

On the other hand, when a column is in multiple source files, but will be used for the joining of 2 clusters (using the join function under the LDM tab in the top left), use different names in the associated source files.

id – the identifier for a column of data or metadata object (mandatory value). An ID must be unique within an Inzata project. The id is automatically derived from a column’s name by substitution of non-valid characters with an underline character. You can modify them in this edit box where the name is listed in figure 3.

An identifier can have max 255-char length. Allowed chars are only [“A-Z”, “a-z”, “0-9_”]. ID values are copied from the header of a CSV file. You can modify them here via this edit box. We recommend to use the prefix “a_” for attributes and “f_” for facts – please see Inzata Standard prefix (check box) above.

Similar to the names, when a column is the same within different source files, e.g. “Customer” in both the “Order” and “Payment” clusters, use the same id in all source files. Then the same attribute will be generated in a project and it will be part of several clusters.

On the other hand, when a column is also in more source files, but it will be used for the joining of 2 clusters (using join function), use different ids in associated source files.

Column type – there are 4 possible values: Attribute, Fact, Skip (the column is not processed), and Label (please see Rules for CSV file).

Attribute:

When Attribute is selected the Name and id items (described above) are used as the Attribute Name and the Attribute id. There are 5 possible data types for a column, which are further detailed below. If the “String” Data Type (described below) is selected concurrently with the Attribute this column from data layout is defined as the label with the “Text” default name. You can rename this label name and also label id by clicking on the Attribute Name. Then the Right Properties Panel appears for this Column Option.

Note: This column is used as the primary key for the attribute definition. It means that for each unique value from this column is generated a unique id for this attribute.

- Data type – there are the following possible data types for each Attribute column:
  - - String
  - - Date
  - - Date Time
  - - Time
  - - Ident

Ident is a special type defined as a numeric one. The value is directly used as an identifier of an attribute and thus identifiers are not generated in a project. These attributes have no labels (e.g. codes, descriptions).

When the other column types are selected, then the identifier values are generated in Inzata format and the CSV values are stored as a TEXT label of an attribute.

For the date and date time data type the format of source data is displayed on the next row.

Label:

When Label is selected the Name and id items (described above) are used

as the Label Name and the Label id. You also have to assign which attribute the label belongs to. Select from the Attribute Name list which is displayed below the Column Type. The previous attribute is pre-selected.

Fact:

The data type for a fact is always Numeric. Facts have additional settings for the number of decimal places. This is the number format used in the Inzata environment. For instance when you select the “0.00” numeric format, then your number 1.1234 will be loaded as 1.12. Use the “0” selection for integer numbers.

Skip:

The final column type is Skip. This is a column type that is not selected by the AI when uploading a new file because it tells InModeler to disregard that column for the upload to a project. On a new upload, this type is usually only used when manually selected by the user

To set more detail parameters up, click on “show advanced options“ as in figure 4:

Figure 4: New Data Load Advanced Options

Inzata Standard prefix (check box)

The standard identifiers of attributes are strings with the prefix “a_”, for facts there is the “f_” prefix . The identifier of an object has to be unique within an Inzata project. These prefixes help to keep consistency of metadata in the Inzata project. When this box is checked the prefixes are automatically added into ID strings.

File Encoding

Using this, set the encoding of your source text file.

Create _____ metrics from ___ checkboxes

Inzata can autogenerate commonly used aggregation metric formulas from facts and attributes. For facts, available metrics are: SUM, AVG, MIN and MAX. For Attributes, it can generate COUNT metrics. Once generated, these will appear automatically in the Metrics menu in InBoard.

After setting the parameters, press the “Complete Load“ button.

Inzata Support

Did You Know?

Profiling and Approving a New Data Load

Facts, Attributes, and Metrics

Sprint review 151

Sprint review 148 – from End-User perspective

Sprint review 149

New Feature Releases/Change Logs

Snowflake DB

Push Process

FAQ – Data Enrichment

FAQ – Modeling Data

FAQ – Loading and Moving Data

FAQ – General

FAQs