v If oneormore key variables are usedtomatch cases and you indicate that the files are already sorted,
the two datasets must be sortedbyascending order of the key variable(s).
v Variable names in the second data file that duplicate variable names in the active datasetare excluded
bydefault becauseAdd Variables assumes that these variables contain duplicate information.
Indicatecasesource asvariable. Indicates the source data file for each case. This variablehas a value of 0 for
cases from the active dataset and a value of 1 forcases from the external data file.
ExcludedVariables. Variables to be excluded from the new,merged data file.By default, this list contains
any variable names from theother datasetthat duplicatevariablenames in the active dataset. Variables
from the active dataset are identifiedwith an asterisk (*). Variables from the other dataset are identified
with a plus sign (+). Ifyou want to include an excludedvariable with a duplicate name in the merged
file, you can rename it and addit to the list of variables to be included.
New Active Dataset. Variables to be includedin the new, mergeddataset.By default, all unique variable
names in both datasets are included on the list.
Key Variables.You can use keyvariables to correctly match cases in the twofiles.For example,there
may be an ID variable that identifies each case.
v If oneofthefiles is a table lookup file, you must usekey variables to match cases in the two files. Key
values must be unique in table lookup files. If there are multiple keys,the combination of key values
v The keyvariables musthave the same names in both datasets. Use Rename to change the key variable
names ifthey are not the same.
Non-active or active dataset is keyedtable. Akeyedtable, or table lookupfile, is a file in which data
for each "case" can be appliedtomultiple cases in theother data file.For example,if one file contains
information on individual familymembers (such as sex, age, education) andthe otherfilecontains
overall familyinformation (such as total income,family size, location),you can use thefileof family data
as a table lookup file and apply the common family data to each individual familymember in the
1. Open atleast oneof thedata files that you want to merge. If you have multipledatasets open, make
one of the datasets thatyou want to merge the active dataset.
2. From the menus choose:
Data > Merge Files > Add Variables...
3. Select the dataset or external IBM SPSS Statistics data file to merge with theactivedataset.
ToSelect Key Variables
1. Select the variables from theexternal file variables (+) on the ExcludedVariables list.
2. Select Match cases onkey variables in sorted files.
3. Add the variables totheKey Variables list.
Thekey variables must existin both theactive dataset and the otherdataset.
Add Variables: Rename
You can rename variables from either the active dataset or theother data file before moving them to the
list of variables to be includedin the merged data file. This is primarilyuseful if you wanttoinclude two
variables with the same name that contain different information in thetwo files or if a key variable has
different names in the twofiles.
Merging More Than Two Data Sources
Using command syntax, you can mergemore than twodata files.
v Use MATCH FILES to mergemultiple files thatdon't contain key variables or multiple files already sorted
on key variable values.
v Use STAR JOIN to merge multiple files where there is one case data file and multiple table lookup files.
Files do notneedtobe sortedin order of key variable values, andeach table lookup filecan use a
different key variable.
Aggregate Data aggregates groups ofcases in the active dataset intosingle cases andcreates a new,
aggregated file or creates new variables in the active datasetthat contain aggregated data. Cases are
aggregated based on the value of zero or more break (grouping) variables. If no break variables are
specified, then the entire datasetis a single break group.
v If you create a new, aggregateddata file, the new data filecontains one case for each group defined
bythe break variables.For example,if thereis one breakvariablewith two values, the new data file
will contain only two cases. Ifno break variable is specified,the new data file will contain one case.
v If you add aggregate variables to the active dataset, thedata file itself is notaggregated. Each case
with the samevalue(s) of the break variable(s)receives the same values for thenew aggregate
variables. For example, if gender is theonly break variable, all males would receive the same valuefor
anew aggregate variable thatrepresents average age. If no break variable is specified,all cases would
receive the same value for a new aggregate variable thatrepresents average age.
Break Variable(s). Cases are grouped together based on the values of thebreak variables. Each unique
combination of breakvariablevalues defines a group.When creating a new, aggregated data file, all
break variables are saved in the new filewith their existingnames and dictionary information. Thebreak
variable, if specified, can be eithernumeric or string.
AggregatedVariables. Source variables are used with aggregatefunctions to create new aggregate
variables. Theaggregatevariablenameis followedby an optional variable label,the name of the
aggregate function, and the sourcevariablenamein parentheses.
You can overridethe default aggregate variable names with new variable names, provide descriptive
variable labels,andchange thefunctions usedtocomputetheaggregated data values.You can alsocreate
avariable thatcontains the number of cases in each break group.
ToAggregate a Data File
1. From the menus choose:
Data > Aggregate...
2. Optionally select break variables that definehow cases are grouped to create aggregated data.If no
break variables are specified, then the entire dataset is a single breakgroup.
3. Select one or more aggregate variables.
4. Select an aggregate function for each aggregate variable.
You can addaggregate variables to the active datasetorcreatea new,aggregateddata file.
v Add aggregated variables to active dataset.New variables based on aggregate functions are added to the
activedataset. The data file itself is not aggregated. Each case with thesame value(s) of the break
variable(s) receives the same values for the new aggregatevariables.
v Create a new dataset containing only the aggregated variables. Saves aggregateddata to a new dataset in
the current session. The datasetincludes the breakvariables that define the aggregated cases and all
aggregate variables definedby aggregate functions.The active dataset is unaffected.
v Write a new data file containing only the aggregated variables. Saves aggregated data to an external data
file.The file includes the break variables thatdefine the aggregated cases andall aggregate variables
defined by aggregate functions. The activedataset is unaffected.
Sorting Options forLarge Data Files
For very large data files, it may be more efficienttoaggregatepresorteddata.
Fileisalready sorted on break variable(s). If thedata havealready been sorted by values ofthe break
variables, this option enables the procedure to run more quickly and useless memory. Usethis option
v Data must by sortedby values of the break variables in the same order as the break variables specified
for theAggregate Data procedure.
v If you are addingvariables to the active dataset, select this option only if the data are sortedby
ascending values of the break variables.
Sort file beforeaggregating. In very rareinstances with large data files, you may findit necessary tosort the
data file by values of the break variables prior to aggregating. This option is not recommendedunless
you encountermemory or performance problems.
Aggregate Data: Aggregate Function
This dialog box specifies the function touse to calculate aggregated data values for selected variables on
theAggregate Variables list in theAggregate Data dialog box.Aggregate functions include:
v Summaryfunctions fornumeric variables,includingmean, median, standard deviation, andsum
v Numberof cases, including unweighted,weighted,nonmissing, and missing
v Percentage, fraction or count of values above or below a specifiedvalue
v Percentage, fraction,orcount of values inside or outside of a specified range
Aggregate Data: Variable Name and Label
Aggregate Data assigns default variable names for the aggregatedvariables in the new data file. This
dialogbox enables you to change the variable name forthe selected variable on theAggregate Variables
list and providea descriptive variable label. See the topic “Variablenames” on page 50 for more
SplitFile splits the data file into separategroups for analysis based on thevalues of one or more
grouping variables. If you select multiple grouping variables, cases are grouped by each variable within
categories ofthe preceding variable on the Groups Based On list. For example, if you select genderas the
first grouping variable and minority as the second grouping variable, cases will be grouped by minority
classification within each gender category.
v You can specifyup to eight grouping variables.
v Each eight bytes of a long string variable (stringvariables longerthan eightbytes)counts as a variable
towardthe limit of eightgrouping variables.
v Cases shouldbe sorted by values of the grouping variables andin thesame order that variables are
listed in the Groups Based On list. If the data file isn't alreadysorted, select Sort the file bygrouping
Compare groups. Split-file groups are presented together for comparison purposes.For pivot tables, a
single pivot table is created and each split-file variable can be movedbetween tabledimensions. For
charts, a separate chart is createdfor each split-file group and the charts aredisplayed togetherin the
Organize outputbygroups.All results from each procedure are displayed separately for each split-file
ToSplit a Data FileforAnalysis
1. From the menus choose:
Data > Split File...
2. Select Compare groups or Organize output by groups.
3. Select one or more grouping variables.
Select Cases provides several methods for selecting a subgroup of cases basedon criteria that include
variables andcomplex expressions. You can also select a random sample of cases. Thecriteria used to
define a subgroup can include:
v Variable values and ranges
v Dateandtime ranges
v Case (row)numbers
v Arithmetic expressions
v Logical expressions
All cases. Turns case filteringoff and uses all cases.
If condition is satisfied. Use a conditional expression to select cases. Iftheresult of the conditional
expression is true, the case is selected. If theresult is falseormissing, the case is not selected.
Randomsample of cases.Selects a random sample based on an approximatepercentage or an exactnumber
Based on timeor case range. Selects cases based on a range ofcasenumbers or a range of dates/times.
Use filtervariable.Use the selected numeric variable from the data file as thefilter variable.Cases with
any value otherthan 0 ormissing for the filtervariableare selected.
This section controls the treatment of unselectedcases. You can choose one of the following alternatives
forthe treatmentof unselected cases:
v Filter out unselectedcases. Unselected cases are not includedin the analysis but remain in the
dataset. You can usetheunselected cases later in the session if you turn filtering off.If you select a
random sample or if you select cases basedon a conditional expression, this generates a variable
namedfilter_$ with a valueof1 for selectedcases and a value of 0 forunselected cases.
v Copy selected cases to a new dataset.Selectedcases arecopiedtoa new dataset, leaving the original
dataset unaffected. Unselectedcases arenot included in the new dataset and are left in their original
state in the original dataset.
v Delete unselectedcases. Unselected cases are deleted from the dataset. Deleted cases can be recovered
only by exiting from the file without saving any changes andthen reopening the file. The deletion of
cases is permanentif you save the changes to the data file.
Note:If you delete unselected cases and save the file,the cases cannot be recovered.
ToSelect a Subset of Cases
1. From the menus choose:
Data > Select Cases...
2. Select one of the methods for selecting cases.
3. Specify the criteria forselectingcases.
Select cases: If
This dialog box allows you to select subsets of cases usingconditional expressions.Aconditional
expression returns a value of true, false, ormissing for each case.
v If theresult of a conditional expression is true,thecaseis included in the selectedsubset.
v If theresult of a conditional expression is false or missing,thecaseis not included in theselected
v Mostconditional expressions use one or moreof thesix relational operators (<, >, <=, >=, =,and~=)
on the calculator pad.
v Conditional expressions can include variable names, constants, arithmetic operators, numeric (and
other) functions, logical variables, and relational operators.
Select cases: Random sample
This dialog box allows you to select a random sample based on an approximate percentage or an exact
number of cases. Sampling is performed without replacement; so, thesamecasecannotbe selected more
Approximately. Generates a random sampleofapproximately the specified percentageof cases. Since this
routine makes an independent pseudo-random decision for each case,the percentage of cases selected can
onlyapproximate the specified percentage. The more cases there arein thedata file, the closerthe
percentageof cases selected is to thespecifiedpercentage.
Exactly. Auser-specifiednumberof cases.You must also specify the number of cases from which to
generate the sample. This second number should be less than or equal to the total numberof cases in the
data file.If the numberexceeds thetotal numberofcases in the data file,thesample will contain
proportionally fewer cases than therequested number.
Select cases: Range
This dialog box selects cases based on a rangeofcasenumbers or a range of dates or times.
v Case ranges are basedon row numberas displayed in the Data Editor.
v Date andtime ranges are available only fortimeseriesdata with defined date variables (Data menu,
Note:If unselected cases are filtered (ratherthan deleted), subsequently sorting the dataset will turn off
filteringapplied by this dialog.
Weight Cases gives cases different weights (bysimulated replication) for statistical analysis.
v The values of theweighting variable should indicate the number of observations representedby single
cases in yourdata file.
v Cases with zero, negative, or missing values for the weighting variable are excluded from analysis.
v Fractional values are valid and someprocedures, such as Frequencies,Crosstabs, and Custom Tables,
will use fractional weightvalues.However, mostprocedures treat the weighting variable as a
replication weight and will simply roundfractional weights to the nearest integer.Some procedures
ignore the weightingvariablecompletely, and this limitation is noted in the procedure-specific
Once you apply a weight variable, itremains in effect until you selectanother weightvariableorturn off
weighting.If you savea weighteddata file, weighting information is savedwith the data file. You can
turn off weighting at any time, even after the file has been savedin weighted form.
Weights in Crosstabs.The Crosstabs procedure has several options for handling case weights.
Weights in scatterplots and histograms. Scatterplots andhistograms have an option for turning case
weights on andoff, but this does not affect cases with a zero, negative, or missingvalue for the weight
variable. These cases remain excludedfrom the chart even if you turn weighting off from within the
1. From the menus choose:
Data > WeightCases...
2. Select Weight cases by.
3. Select a frequencyvariable.
The values of the frequency variable are usedas case weights. For example, a case with a value of 3 for
the frequency variable will represent three cases in the weighted data file.
Use the Restructure Data Wizard to restructureyour data forthe procedure that you want to use.The
wizard replaces the current file with a new, restructured file. The wizard can:
v Restructure selectedvariables intocases
v Restructure selectedcases intovariables
v Transpose all data
To Restructure Data
1. From the menus choose:
Data > Restructure...
2. Select the type of restructuring that you want to do.
3. Select the data torestructure.
Optionally, you can:
v Create identification variables, which allow you to trace a value in the new file back to a value in the
v Sortthedata prior to restructuring
v Define options for the new file
v Paste the commandsyntax into a syntax window
Restructure Data Wizard: Select Type
Use the Restructure Data Wizard to restructureyour data. In the first dialogbox, select thetype of
restructuring that you want todo.
v Restructure selectedvariables into cases. Choose this when you have groups of relatedcolumns in
your data and you want them to appear in groups of rows in thenew data file. If you choose this, the
wizardwill display the steps for Variables toCases.
v Restructure selectedcases into variables. Choose this when you have groups of relatedrows in your
data andyou want them toappear in groups of columns in the new data file.If you choosethis, the
wizardwill display the steps for Cases to Variables.
IBM SPSSStatistics23Core SystemUser's Guide
v Transpose all data. Choose this when you want to transpose your data. All rows will become columns
andall columns will become rows in the new data. This choice closes the Restructure Data Wizardand
opens the TransposeData dialogbox.
Deciding How to Restructure the Data
Avariable contains information thatyou want to analyze--forexample, a measurement or a score.Acase
is an observation--for example, an individual. In a simple data structure, each variable is a single column
in yourdata andeach caseis a single row. So, for example, ifyou were measuring test scores for all
students in a class, all scorevalues would appear in only one column, and there wouldbe a row for each
When you analyze data, you areoften analyzinghow a variable varies according to some condition.The
condition can be a specific experimental treatment,a demographic,a point in time,orsomething else. In
data analysis, conditions of interest are often referredto as factors.When you analyze factors, you have a
complex data structure. You may have information about a variable in more than one column in your data
(for example, a column for each level of a factor),oryou may haveinformation about a case in more
than one row (for example, a row for each level of a factor). The Restructure Data Wizard helps you to
restructurefiles with a complex data structure.
Thestructure of the current file and the structure thatyou want in the new file determine the choices that
you make in thewizard.
How are the data arranged in the current file? Thecurrentdata may be arranged sothat factors are
recordedin a separate variable (in groups of cases) or with the variable (in groups of variables).
v Groups of cases. Does the current file havevariables and conditions recorded in separatecolumns?
Table10. Data with variables and conditions inseparate columns
In this example, the first two rows are a case groupbecause they are related.Theycontain data for the
same factor level. In IBM SPSS Statistics data analysis, the factor is often referred to as a grouping
variable when the data are structured this way.
v Groups of columns.Does the currentfile have variables and conditions recordedin thesame column?
Table11. Data withvariables and conditions insame column
In this example, the two columns are a variable group because they arerelated. They contain data for the
same variable--var_1 for factor level 1 and var_2 forfactor level 2. In IBM SPSS Statistics data analysis, the
factor is often referred to as a repeated measure when thedata are structuredthis way.
How shouldthe data be arranged in the new file? This is usually determinedbytheprocedure thatyou
want to use to analyzeyour data.
Chapter9.File handling andfile transformations
v Procedures thatrequire groups of cases. Your data must be structured in case groups to do analyses
that require a groupingvariable. Examples are univariate, multivariate,andvariancecomponentswith
General Linear Model,Mixed Models,andOLAP Cubes and independent samples with T Testor
Nonparametric Tests.If your current data structure is variable groups andyou want to do these
analyses, select Restructure selected variables intocases.
v Procedures thatrequire groups of variables. Your data must be structuredin variablegroups to
analyze repeated measures. Examples are repeated measureswith General Linear Model, time-dependent
covariate analysis with Cox RegressionAnalysis,paired sampleswith TTest,orrelated sampleswith
Nonparametric Tests.If your current data structure is case groups andyou want to do theseanalyses,
select Restructure selected cases into variables.
Example of Variables to Cases
In this example, test scores are recorded in separatecolumns foreach factor, A andB.
Table12.Testscores recorded in separatecolumns for each factor
You wanttodoan independent-samples t test. You have a column group consisting of score_a and score_b,
but you don't havethegrouping variable thatthe procedure requires.Select Restructure selected
variables intocases in the RestructureData Wizard,restructure one variable group into a new variable
named score,andcreatean index named group. The new data fileis shown in the followingfigure.
Table13.New, restructureddata for variables to cases
When you run the independent-samples t test, you can now use group as the grouping variable.
Example of Cases to Variables
In this example, test scores are recorded twice for each subject—beforeandafter a treatment.
Table14.Currentdata for cases tovariables
IBM SPSSStatistics23Core SystemUser's Guide
Documents you may be interested
Documents you may be interested