Warning: Attempt to read property "date" on null in /usr/local/www/websvn.planix.org/blame.php on line 247

Warning: Attempt to read property "msg" on null in /usr/local/www/websvn.planix.org/blame.php on line 247
WebSVN – planix.SVN – Blame – /os/branches/feature_tlsv12/sys/doc/sam/sam.ms – Rev 2

Subversion Repositories planix.SVN

Rev

Go to most recent revision | Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.HTML "The Text Editor sam
2
.Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM"
3
.ds DY "31 May 1987
4
.ds DR "Revised 1 July 1987
5
.de CW		\" puts first arg in CW font, same as UL; maintains font
6
\%\&\\$3\f(CW\\$1\fP\&\\$2
7
..
8
.de Cs
9
.br
10
.fi
11
.ft 2
12
.ps -2
13
.vs -2
14
..
15
.de Ce
16
.br
17
.nf
18
.ft 1
19
.ps
20
.vs
21
.sp
22
..
23
.de XP
24
.ie h .html - <center><img src="\\$1.gif" /></center>
25
.el .BP \\$1.ps \\$2
26
..
27
.TL
28
The Text Editor \&\f(CWsam\fP
29
.AU
30
Rob Pike
31
rob@plan9.bell-labs.com
32
.AB
33
.LP
34
.CW Sam
35
is an interactive multi-file text editor intended for
36
bitmap displays.
37
A textual command language
38
supplements the mouse-driven, cut-and-paste interface
39
to make complex or
40
repetitive editing tasks easy to specify.
41
The language is characterized by the composition of regular expressions
42
to describe the structure of the text being modified.
43
The treatment of files as a database, with changes logged
44
as atomic transactions, guides the implementation and
45
makes a general `undo' mechanism straightforward.
46
.PP
47
.CW Sam
48
is implemented as two processes connected by a low-bandwidth stream,
49
one process handling the display and the other the editing
50
algorithms.  Therefore it can run with the display process
51
in a bitmap terminal and the editor on a local host,
52
with both processes on a bitmap-equipped host, or with
53
the display process in the terminal and the editor in a
54
remote host.
55
By suppressing the display process,
56
it can even run without a bitmap terminal.
57
.PP
58
This paper is reprinted from Software\(emPractice and Experience,
59
Vol 17, number 11, pp. 813-845, November 1987.
60
The paper has not been updated for the Plan 9 manuals.  Although
61
.CW Sam
62
has not changed much since the paper was written, the system around it certainly has.
63
Nonetheless, the description here still stands as the best introduction to the editor.
64
.AE
65
.SH
66
Introduction
67
.LP
68
.CW Sam
69
is an interactive text editor that combines cut-and-paste interactive editing with
70
an unusual command language based on the composition of regular expressions.
71
It is written as two programs: one, the `host part,' runs on a UNIX system
72
and implements the command language and provides file access; the other, the
73
`terminal part,' runs asynchronously
74
on a machine with a mouse and bitmap display
75
and supports the display and interactive editing.
76
The host part may be even run in isolation on an ordinary terminal
77
to edit text using the command
78
language, much like a traditional line editor,
79
without assistance from a mouse or display.
80
Most often,
81
the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal
82
(actually on a Teletype DMD 5620, the production version of the Blit), whose
83
host connection is an ordinary 9600 bps RS232 link;
84
on the SUN computer the host and display processes run on a single machine,
85
connected by a pipe.
86
.PP
87
.CW Sam
88
edits uninterpreted
89
ASCII text.
90
It has no facilities for multiple fonts, graphics or tables,
91
unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d
92
or Lara.\u\s-4\&5\s+4\d
93
Also unlike them, it has a rich command language.
94
(Throughout this paper, the phrase
95
.I
96
command language
97
.R
98
refers to
99
textual commands; commands activated from the mouse form the
100
.I mouse
101
.I language. )
102
.CW Sam
103
developed as an editor for use by programmers, and tries to join
104
the styles of the UNIX text editor
105
.CW ed \u\s-4\&6,7\s+4\d
106
with that of interactive cut-and-paste editors by
107
providing a comfortable mouse-driven interface
108
to a program with a solid command language driven by regular expressions.
109
The command language developed more than the mouse language, and
110
acquired a notation for describing the structure of files
111
more richly than as a sequence of lines,
112
using a dataflow-like syntax for specifying changes.
113
.PP
114
The interactive style was influenced by
115
.CW jim ,\u\s-4\&1\s+4\d
116
an early cut-and-paste editor for the Blit, and by
117
.CW mux ,\u\s-4\&8\s+4\d
118
the Blit window system.
119
.CW Mux
120
merges the original Blit window system,
121
.CW mpx ,\u\s-4\&1\s+4\d
122
with cut-and-paste editing, forming something like a
123
multiplexed version of
124
.CW jim
125
that edits the output of (and input to) command sessions rather than files.
126
.PP
127
The first part of this paper describes the command language, then the mouse
128
language, and explains how they interact.
129
That is followed by a description of the implementation,
130
first of the host part, then of the terminal part.
131
A principle that influenced the design of
132
.CW sam
133
is that it should have no explicit limits, such as upper limits on
134
file size or line length.
135
A secondary consideration is that it be efficient.
136
To honor these two goals together requires a method for efficiently
137
manipulating
138
huge strings (files) without breaking them into lines,
139
perhaps while making thousands of changes
140
under control of the command language.
141
.CW Sam 's
142
method is to
143
treat the file as a transaction database, implementing changes as atomic
144
updates.  These updates may be unwound easily to `undo' changes.
145
Efficiency is achieved through a collection of caches that minimizes
146
disc traffic and data motion, both within the two parts of the program
147
and between them.
148
.PP
149
The terminal part of
150
.CW sam
151
is fairly straightforward.
152
More interesting is how the two halves of the editor stay
153
synchronized when either half may initiate a change.
154
This is achieved through a data structure that organizes the
155
communications and is maintained in parallel by both halves.
156
.PP
157
The last part of the paper chronicles the writing of
158
.CW sam
159
and discusses the lessons that were learned through its development and use.
160
.PP
161
The paper is long, but is composed largely of two papers of reasonable length:
162
a description of the user interface of
163
.CW sam
164
and a discussion of its implementation.
165
They are combined because the implementation is strongly influenced by
166
the user interface, and vice versa.
167
.SH
168
The Interface
169
.LP
170
.CW Sam
171
is a text editor for multiple files.
172
File names may be provided when it is invoked:
173
.P1
174
sam file1 file2 ...
175
.P2
176
and there are commands
177
to add new files and discard unneeded ones.
178
Files are not read until necessary
179
to complete some command.
180
Editing operations apply to an internal copy
181
made when the file is read; the UNIX file associated with the copy
182
is changed only by an explicit command.
183
To simplify the discussion, the internal copy is here called a
184
.I file ,
185
while the disc-resident original is called a
186
.I
187
disc file.
188
.R
189
.PP
190
.CW Sam
191
is usually connected to a bitmap display that presents a cut-and-paste
192
editor driven by the mouse.
193
In this mode, the command language is still available:
194
text typed in a special window, called the
195
.CW sam
196
.I window,
197
is interpreted
198
as commands to be executed in the current file.
199
Cut-and-paste editing may be used in any window \(em even in the
200
.CW sam
201
window to construct commands.
202
The other mode of operation, invoked by starting
203
.CW sam
204
with the option
205
.CW -d
206
(for `no download'),
207
does not use the mouse or bitmap display, but still permits
208
editing using the textual command language, even on an ordinary terminal,
209
interactively or from a script.
210
.PP
211
The following sections describe first the command language (under
212
.CW sam\ -d
213
and in the
214
.CW sam
215
window), and then the mouse interface.
216
These two languages are nearly independent, but connect through the
217
.I current
218
.I text,
219
described below.
220
.SH 2
221
The Command Language
222
.LP
223
A file consists of its contents, which are an array of characters
224
(that is, a string); the
225
.I name
226
of the associated disc file; the
227
.I
228
modified bit
229
.R
230
that states whether the contents match those of
231
the disc file;
232
and a substring of the contents, called the
233
.I
234
current text
235
.R
236
or
237
.I dot
238
(see Figures 1 and 2).
239
If the current text is a null string, dot falls between characters.
240
The
241
.I value
242
of dot is the location of the current text; the
243
.I contents
244
of dot are the characters it contains.
245
.CW Sam
246
imparts to the text no two-dimensional interpretation such as columns
247
or fields; text is always one-dimensional.
248
Even the idea of a `line' of text as understood by most UNIX programs
249
\(em a sequence of characters terminated by a newline character \(em
250
is only weakly supported.
251
.PP
252
The
253
.I
254
current file
255
.R
256
is the file to which editing commands refer.
257
The current text is therefore dot in the current file.
258
If a command doesn't explicitly name a particular file or piece of text,
259
the command is assumed to apply to the current text.
260
For the moment, ignore the presence of multiple files and consider
261
editing a single file.
262
.KF L
263
.XP fig1 3.5i
264
.Cs
265
Figure 1. A typical
266
.CW sam
267
screen, with the editing menu presented.
268
The
269
.CW sam
270
(command language) window is in the middle, with file windows above and below.
271
(The user interface makes it easy to create these abutting windows.)
272
The partially obscured window is a third file window.
273
The uppermost window is that to which typing and mouse operations apply,
274
as indicated by its heavy border.
275
Each window has its current text highlighted in reverse video.
276
The
277
.CW sam
278
window's current text is the null string on the last visible line,
279
indicated by a vertical bar.
280
See also Figure 2.
281
.Ce
282
.KE
283
.PP
284
Commands have one-letter names.
285
Except for non-editing commands such as writing
286
the file to disc, most commands make some change
287
to the text in dot and leave dot set to the text resulting from the change.
288
For example, the delete command,
289
.CW d ,
290
deletes the text in dot, replacing it by the null string and setting dot
291
to the result.
292
The change command,
293
.CW c ,
294
replaces dot by text delimited by an arbitrary punctuation character,
295
conventionally
296
a slash.  Thus,
297
.P1
298
c/Peter/
299
.P2
300
replaces the text in dot by the string
301
.CW Peter .
302
Similarly,
303
.P1
304
a/Peter/
305
.P2
306
(append) adds the string after dot, and
307
.P1
308
i/Peter/
309
.P2
310
(insert) inserts before dot.
311
All three leave dot set to the new text,
312
.CW Peter .
313
.PP
314
Newlines are part of the syntax of commands:
315
the newline character lexically terminates a command.
316
Within the inserted text, however, newlines are never implicit.
317
But since it is often convenient to insert multiple lines of text,
318
.CW sam
319
has a special
320
syntax for that case:
321
.P1
322
a
323
some lines of text
324
to be inserted in the file,
325
terminated by a period
326
on a line by itself
327
\&.
328
.P2
329
In the one-line syntax, a newline character may be specified by a C-like
330
escape, so
331
.P1
332
c/\en/
333
.P2
334
replaces dot by a single newline character.
335
.PP
336
.CW Sam
337
also has a substitute command,
338
.CW s :
339
.P1
340
s/\f2expression\fP/\f2replacement\fP/
341
.P2
342
substitutes the replacement text for the first match, in dot,
343
of the regular expression.
344
Thus, if dot is the string
345
.CW Peter ,
346
the command
347
.P1
348
s/t/st/
349
.P2
350
changes it to
351
.CW Pester .
352
In general,
353
.CW s
354
is unnecessary, but it was inherited from
355
.CW ed
356
and it has some convenient variations.
357
For instance, the replacement text may include the matched text,
358
specified by
359
.CW & :
360
.P1
361
s/Peter/Oh, &, &, &, &!/
362
.P2
363
.PP
364
There are also three commands that apply programs
365
to text:
366
.P1
367
< \f2UNIX program\fP
368
.P2
369
replaces dot by the output of the UNIX program.
370
Similarly, the
371
.CW >
372
command
373
runs the program with dot as its standard input, and
374
.CW |
375
does both.  For example,
376
.P1
377
| sort
378
.P2
379
replaces dot by the result of applying the standard sorting utility to it.
380
Again, newlines have no special significance for these
381
.CW sam
382
commands.
383
The text acted upon and resulting from these commands is not necessarily
384
bounded by newlines, although for connection with UNIX programs,
385
newlines may be necessary to obey conventions.
386
.PP
387
One more command:
388
.CW p
389
prints the contents of dot.
390
Table I summarizes
391
.CW sam 's
392
commands.
393
.KF
394
.TS
395
center;
396
c s
397
lfCW l.
398
Table I. \f(CWSam\fP commands
399
.sp .4
400
.ft CW
401
_
402
.ft
403
.sp .4
404
\f1Text commands\fP	
405
.sp .4
406
_
407
.sp .4
408
a/\f2text\fP/	Append text after dot
409
c/\f2text\fP/	Change text in dot
410
i/\f2text\fP/	Insert text before dot
411
d	Delete text in dot
412
s/\f2regexp\fP/\f2text\fP/	Substitute text for match of regular expression in dot
413
m \f2address\fP	Move text in dot after address
414
t \f2address\fP	Copy text in dot after address
415
.sp .4
416
_
417
.sp .4
418
\f1Display commands\fP	
419
.sp .4
420
_
421
.sp .2
422
p	Print contents of dot
423
\&=	Print value (line numbers and character numbers) of dot
424
.sp .4
425
_
426
.sp .4
427
\f1File commands\fP
428
.sp .4
429
_
430
.sp .2
431
b \f2file-list\fP	Set current file to first file in list that \f(CWsam\fP has in menu
432
B \f2file-list\fP	Same as \f(CWb\fP, but load new files
433
n	Print menu lines of all files
434
D \f2file-list\fP	Delete named files from \f(CWsam\fP
435
.sp .4
436
_
437
.sp .4
438
\f1I/O commands\fP	
439
.sp .4
440
_
441
.sp .2
442
e \f2filename\fP	Replace file with named disc file
443
r \f2filename\fP	Replace dot by contents of named disc file
444
w \f2filename\fP	Write file to named disc file
445
f \f2filename\fP	Set file name and print new menu line
446
< \f2UNIX-command\fP	Replace dot by standard output of command
447
> \f2UNIX-command\fP	Send dot to standard input of command
448
| \f2UNIX-command\fP	Replace dot by result of command applied to dot
449
! \f2UNIX-command\fP	Run the command
450
.sp .4
451
_
452
.sp .4
453
\f1Loops and conditionals\fP	
454
.sp .4
455
_
456
.sp .2
457
x/\f2regexp\fP/ \f2command\fP	For each match of regexp, set dot and run command
458
y/\f2regexp\fP/ \f2command\fP	Between adjacent matches of regexp, set dot and run command
459
X/\f2regexp\fP/ \f2command\fP	Run command in each file whose menu line matches regexp
460
Y/\f2regexp\fP/ \f2command\fP	Run command in each file whose menu line does not match
461
g/\f2regexp\fP/ \f2command\fP	If dot contains a match of regexp, run command
462
v/\f2regexp\fP/ \f2command\fP	If dot does not contain a match of regexp, run command
463
.sp .4
464
_
465
.sp .4
466
\f1Miscellany\fP	
467
.sp .4
468
_
469
.sp .2
470
k	Set address mark to value of dot
471
q	Quit
472
u \f2n\fP	Undo last \f2n\fP (default 1) changes
473
{ }	Braces group commands
474
.sp .3
475
.ft CW
476
_
477
.ft
478
.TE
479
.sp
480
.KE
481
.PP
482
The value of dot may be changed by
483
specifying an
484
.I address
485
for the command.
486
The simplest address is a line number:
487
.P1
488
3
489
.P2
490
refers to the third line of the file, so
491
.P1
492
3d
493
.P2
494
deletes the third line of the file, and implicitly renumbers
495
the lines so the old line 4 is now numbered 3.
496
(This is one of the few places where
497
.CW sam
498
deals with lines directly.)
499
Line
500
.CW 0
501
is the null string at the beginning of the file.
502
If a command consists of only an address, a
503
.CW p
504
command is assumed, so typing an unadorned
505
.CW 3
506
prints line 3 on the terminal.
507
There are a couple of other basic addresses:
508
a period addresses dot itself; and
509
a dollar sign
510
.CW $ ) (
511
addresses the null string at the end of the file.
512
.PP
513
An address is always a single substring of the file.
514
Thus, the address
515
.CW 3
516
addresses the characters
517
after the second newline of
518
the file through the third newline of the file.
519
A
520
.I
521
compound address
522
.R
523
is constructed by the comma operator
524
.P1
525
\f2address1\fP,\f2address2\fP
526
.P2
527
and addresses the substring of the file from the beginning of
528
.I address1
529
to the end of
530
.I address2 .
531
For example, the command
532
.CW 3,5p
533
prints the third through fifth lines of the file and
534
.CW .,$d
535
deletes the text from the beginning of dot to the end of the file.
536
.PP
537
These addresses are all absolute positions in the file, but
538
.CW sam
539
also has relative addresses, indicated by
540
.CW +
541
or
542
.CW - .
543
For example,
544
.P1
545
$-3
546
.P2
547
is the third line before the end of the file and
548
.P1
549
\&.+1
550
.P2
551
is the line after dot.
552
If no address appears to the left of the
553
.CW +
554
or
555
.CW - ,
556
dot is assumed;
557
if nothing appears to the right,
558
.CW 1
559
is assumed.
560
Therefore,
561
.CW .+1
562
may be abbreviated to just a plus sign.
563
.PP
564
The
565
.CW +
566
operator acts relative to the end of its first argument, while the
567
.CW -
568
operator acts relative to the beginning.  Thus
569
.CW .+1
570
addresses the first line after dot,
571
.CW .-
572
addresses the first line before dot, and
573
.CW +-
574
refers to the line containing the end of dot.  (Dot may span multiple lines, and
575
.CW +
576
selects the line after the end of dot, then
577
.CW -
578
backs up one line.)
579
.PP
580
The final type of address is a regular expression, which addresses the
581
text matched by the expression.  The expression is enclosed in slashes, as in
582
.P1
583
/\f2expression\fP/
584
.P2
585
The expressions are the same as those in the UNIX program
586
.CW egrep ,\u\s-4\&6,7\s+4\d
587
and include closures, alternations, and so on.
588
They find the
589
.I
590
leftmost longest
591
.R
592
string that matches the expression, that is,
593
the first match after the point where the search is started,
594
and if more than one match begins at the same spot, the longest such match.
595
(I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d)
596
For example,
597
.P1
598
/x/
599
.P2
600
matches the next
601
.CW x
602
character in the file,
603
.P1
604
/xx*/
605
.P2
606
matches the next run of one or more
607
.CW x 's,
608
and
609
.P1
610
/x|Peter/
611
.P2
612
matches the next
613
.CW x
614
or
615
.CW Peter .
616
For compatibility with other UNIX programs, the `any character' operator,
617
a period,
618
does not match a newline, so
619
.P1
620
/.*/
621
.P2
622
matches the text from dot to the end of the line, but excludes the newline
623
and so will not match across
624
the line boundary.
625
.PP
626
Regular expressions are always relative addresses.
627
The direction is forwards by default,
628
so
629
.CW /Peter/
630
is really an abbreviation for
631
.CW +/Peter/ .
632
The search can be reversed with a minus sign, so
633
.P1
634
.CW -/Peter/
635
.P2
636
finds the first
637
.CW Peter
638
before dot.
639
Regular expressions may be used with other address forms, so
640
.CW 0+/Peter/
641
finds the first
642
.CW Peter
643
in the file and
644
.CW $-/Peter/
645
finds the last.
646
Table II summarizes
647
.CW sam 's
648
addresses.
649
.KF
650
.TS
651
center;
652
c s
653
lfCW l.
654
Table II. \f(CWSam\fP addresses
655
.sp .4
656
.ft CW
657
_
658
.ft
659
.sp .4
660
\f1Simple addresses\fP	
661
.sp .4
662
_
663
.sp .2
664
#\f2n\fP	The empty string after character \f2n\fP
665
\f2n\fP	Line \f2n\fP.
666
/\f2regexp\fP/	The first following match of the regular expression
667
-/\f2regexp\fP/	The first previous match of the regular expression
668
$	The null string at the end of the file
669
\&.	Dot
670
\&'	The address mark, set by \f(CWk\fP command
671
"\f2regexp\fP"	Dot in the file whose menu line matches regexp
672
.sp .4
673
_
674
.sp .4
675
\f1Compound addresses\fP	
676
.sp .4
677
_
678
.sp .2
679
\f2a1\fP+\f2a2\fP	The address \f2a2\fP evaluated starting at right of \f2a1\fP
680
\f2a1\fP-\f2a2\fP	\f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
681
\f2a1\fP,\f2a2\fP	From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
682
\f2a1\fP;\f2a2\fP	Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
683
.sp .4
684
_
685
.sp .4
686
.T&
687
c s.
688
T{
689
The operators
690
.CW +
691
and
692
.CW -
693
are high precedence, while
694
.CW ,
695
and
696
.CW ;
697
are low precedence.
698
In both
699
.CW +
700
and
701
.CW -
702
forms,
703
.I a2
704
defaults to 1 and
705
.I a1
706
defaults to dot.
707
If both
708
.I a1
709
and
710
.I a2
711
are present,
712
.CW +
713
may be elided.
714
T}
715
.sp .5
716
.ft CW
717
_
718
.ft
719
.TE
720
.sp
721
.KE
722
.PP
723
The language discussed so far will not seem novel
724
to people who use UNIX text editors
725
such as
726
.CW ed
727
or
728
.CW vi .\u\s-4\&9\s+4\d
729
Moreover, the kinds of editing operations these commands allow, with the exception
730
of regular expressions and line numbers,
731
are clearly more conveniently handled by a mouse-based interface.
732
Indeed,
733
.CW sam 's
734
mouse language (discussed at length below) is the means by which
735
simple changes are usually made.
736
For large or repetitive changes, however, a textual language
737
outperforms a manual interface.
738
.PP
739
Imagine that, instead of deleting just one occurrence of the string
740
.CW Peter ,
741
we wanted to eliminate every
742
.CW Peter .
743
What's needed is an iterator that runs a command for each occurrence of some
744
text.
745
.CW Sam 's
746
iterator is called
747
.CW x ,
748
for extract:
749
.P1
750
x/\f2expression\fP/ \f2command\fP
751
.P2
752
finds all matches in dot of the specified expression, and for each
753
such match, sets dot to the text matched and runs the command.
754
So to delete all the
755
.CW Peters:
756
.P1
757
0,$ x/Peter/ d
758
.P2
759
(Blanks in these examples are to improve readability;
760
.CW sam
761
neither requires nor interprets them.)
762
This searches the entire file
763
.CW 0,$ ) (
764
for occurrences of the string
765
.CW Peter ,
766
and runs the
767
.CW d
768
command with dot set to each such occurrence.
769
(By contrast, the comparable
770
.CW ed
771
command would delete all
772
.I lines
773
containing
774
.CW Peter ;
775
.CW sam
776
deletes only the
777
.CW Peters .)
778
The address
779
.CW 0,$
780
is commonly used, and may be abbreviated to just a comma.
781
As another example,
782
.P1
783
, x/Peter/ p
784
.P2
785
prints a list of
786
.CW Peters,
787
one for each appearance in the file, with no intervening text (not even newlines
788
to separate the instances).
789
.PP
790
Of course, the text extracted by
791
.CW x
792
may be selected by a regular expression,
793
which complicates deciding what set of matches is chosen \(em
794
matches may overlap.  This is resolved by generating the matches
795
starting from the beginning of dot using the leftmost-longest rule,
796
and searching for each match starting from the end of the previous one.
797
Regular expressions may also match null strings, but a null match
798
adjacent to a non-null match is never selected; at least one character
799
must intervene.
800
For example,
801
.P1
802
, c/AAA/
803
x/B*/ c/-/
804
, p
805
.P2
806
produces as output
807
.P1
808
-A-A-A-
809
.P2
810
because the pattern
811
.CW B*
812
matches the null strings separating the
813
.CW A 's.
814
.PP
815
The
816
.CW x
817
command has a complement,
818
.CW y ,
819
with similar syntax, that executes the command with dot set to the text
820
.I between
821
the matches of the expression.
822
For example,
823
.P1
824
, c/AAA/
825
y/A/ c/-/
826
, p
827
.P2
828
produces the same result as the example above.
829
.PP
830
The
831
.CW x
832
and
833
.CW y
834
commands are looping constructs, and
835
.CW sam
836
has a pair of conditional commands to go with them.
837
They have similar syntax:
838
.P1
839
g/\f2expression\fP/ \f2command\fP
840
.P2
841
(guard)
842
runs the command exactly once if dot contains a match of the expression.
843
This is different from
844
.CW x ,
845
which runs the command for
846
.I each
847
match:
848
.CW x
849
loops;
850
.CW g
851
merely tests, without changing the value of dot.
852
Thus,
853
.P1
854
, x/Peter/ d
855
.P2
856
deletes all occurrences of
857
.CW Peter ,
858
but
859
.P1
860
, g/Peter/ d
861
.P2
862
deletes the whole file (reduces it to a null string) if
863
.CW Peter
864
occurs anywhere in the text.
865
The complementary conditional is
866
.CW v ,
867
which runs the command if there is
868
.I no
869
match of the expression.
870
.PP
871
These control-structure-like commands may be composed to construct more
872
involved operations.  For example, to print those lines of text that
873
contain the string
874
.CW Peter :
875
.P1
876
, x/.*\en/ g/Peter/ p
877
.P2
878
The
879
.CW x
880
breaks the file into lines, the
881
.CW g
882
selects those lines containing
883
.CW Peter ,
884
and the
885
.CW p
886
prints them.
887
This command gives an address for the
888
.CW x
889
command (the whole file), but because
890
.CW g
891
does not have an explicit address, it applies to the value of
892
dot produced by the
893
.CW x
894
command, that is, to each line.
895
All commands in
896
.CW sam
897
except for the command to write a file to disc use dot for the
898
default address.
899
.PP
900
Composition may be continued indefinitely.
901
.P1
902
, x/.*\en/ g/Peter/ v/SaltPeter/ p
903
.P2
904
prints those lines containing
905
.CW Peter
906
but
907
.I not
908
those containing
909
.CW SaltPeter .
910
.SH 2
911
Structural Regular Expressions
912
.LP
913
Unlike other UNIX text editors,
914
including the non-interactive ones such as
915
.CW sed
916
and
917
.CW awk ,\u\s-4\&7\s+4\d
918
.CW sam
919
is good for manipulating files with multi-line `records.'
920
An example is an on-line phone book composed of records,
921
separated by blank lines, of the form
922
.P1
923
Herbert Tic
924
44 Turnip Ave., Endive, NJ
925
201-5555642
926
 
927
Norbert Twinge
928
16 Potato St., Cabbagetown, NJ
929
201-5553145
930
 
931
\&...
932
.P2
933
The format may be encoded as a regular expression:
934
.P1
935
(.+\en)+
936
.P2
937
that is, a sequence of one or more non-blank lines.
938
The command to print Mr. Tic's entire record is then
939
.P1
940
, x/(.+\en)+/ g/^Herbert Tic$/ p
941
.P2
942
and that to extract just the phone number is
943
.P1
944
, x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p
945
.P2
946
The latter command breaks the file into records,
947
chooses Mr. Tic's record,
948
extracts the phone number from the record,
949
and finally prints the number.
950
.PP
951
A more involved problem is that of
952
renaming a particular variable, say
953
.CW n ,
954
to
955
.CW num
956
in a C program.
957
The obvious first attempt,
958
.P1
959
, x/n/ c/num/
960
.P2
961
is badly flawed: it changes not only the variable
962
.CW n
963
but any letter
964
.CW n
965
that appears.
966
We need to extract all the variables, and select those that match
967
.CW n
968
and only
969
.CW n :
970
.P1
971
, x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
972
.P2
973
The pattern
974
.CW [A-Za-z_][A-Za-z_0-9]*
975
matches C identifiers.
976
Next
977
.CW g/n/
978
selects those containing an
979
.CW n .
980
Then
981
.CW v/../
982
rejects those containing two (or more) characters, and finally
983
.CW c/num/
984
changes the remainder (identifiers
985
.CW n )
986
to
987
.CW num .
988
This version clearly works much better, but there may still be problems.
989
For example, in C character and string constants, the sequence
990
.CW \en
991
is interpreted as a newline character, and we don't want to change it to
992
.CW \enum.
993
This problem can be forestalled with a
994
.CW y
995
command:
996
.P1
997
, y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
998
.P2
999
(the second
1000
.CW \e
1001
is necessary because of lexical conventions in regular expressions),
1002
or we could even reject character constants and strings outright:
1003
.P1 0
1004
,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1005
.P2
1006
The
1007
.CW y
1008
commands in this version exclude from consideration all character constants
1009
and strings.
1010
The only remaining problem is to deal with the possible occurrence of
1011
.CW \e'
1012
or
1013
.CW \e"
1014
within these sequences, but it's easy to see how to resolve this difficulty.
1015
.PP
1016
The point of these composed commands is successive refinement.
1017
A simple version of the command is tried, and if it's not good enough,
1018
it can be honed by adding a clause or two.
1019
(Mistakes can be undone; see below.
1020
Also, the mouse language makes it unnecessary to retype the command each time.)
1021
The resulting chains of commands are somewhat reminiscent of
1022
shell pipelines.\u\s-4\&7\s+4\d
1023
Unlike pipelines, though, which pass along modified
1024
.I data ,
1025
.CW sam
1026
commands pass a
1027
.I view
1028
of the data.
1029
The text at each step of the command is the same, but which pieces
1030
are selected is refined step by step until the correct piece is
1031
available to the final step of the command line, which ultimately makes the change.
1032
.PP
1033
In other UNIX programs, regular expressions are used only for selection,
1034
as in the
1035
.CW sam
1036
.CW g
1037
command, never for extraction as in the
1038
.CW x
1039
or
1040
.CW y
1041
command.
1042
For example, patterns in
1043
.CW awk \u\s-4\&7\s+4\d
1044
are used to select lines to be operated on, but cannot be used
1045
to describe the format of the input text, or to handle newline-free text.
1046
The use of regular expressions to describe the structure of a piece
1047
of text rather than its contents, as in the
1048
.CW x
1049
command, 
1050
has been given a name:
1051
.I
1052
structural regular expressions.
1053
.R
1054
When they are composed, as in the above example,
1055
they are pleasantly expressive.
1056
Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d
1057
.PP
1058
.SH 2
1059
Multiple files
1060
.LP
1061
.CW Sam
1062
has a few other commands, mostly relating to input and output.
1063
.P1
1064
e discfilename
1065
.P2
1066
replaces the contents and name of the current file with those of the named
1067
disc file;
1068
.P1
1069
w discfilename
1070
.P2
1071
writes the contents to the named disc file; and
1072
.P1
1073
r discfilename
1074
.P2
1075
replaces dot with the contents of the named disc file.
1076
All these commands use the current file's name if none is specified.
1077
Finally,
1078
.P1
1079
f discfilename
1080
.P2
1081
changes the name associated with the file and displays the result:
1082
.P1
1083
\&'-. discfilename
1084
.P2
1085
This output is called the file's
1086
.I
1087
menu line,
1088
.R
1089
because it is the contents of the file's line in the button 3 menu (described
1090
in the
1091
next section).
1092
The first three characters are a concise notation for the state of the file.
1093
The apostrophe signifies that the file is modified.
1094
The minus sign indicates the number of windows
1095
open on the file (see the next section):
1096
.CW -
1097
means none,
1098
.CW +
1099
means one, and
1100
.CW *
1101
means more than one.
1102
Finally, the period indicates that this is the current file.
1103
These characters are useful for controlling the
1104
.CW X
1105
command, described shortly.
1106
.PP
1107
.CW Sam
1108
may be started with a set of disc files (such as all the source for
1109
a program) by invoking it with a list of file names as arguments, and
1110
more may be added or deleted on demand.
1111
.P1
1112
B discfile1 discfile2 ...
1113
.P2
1114
adds the named files to
1115
.CW sam 's
1116
list, and
1117
.P1
1118
D discfile1 discfile2 ...
1119
.P2
1120
removes them from
1121
.CW sam 's
1122
memory (without effect on associated disc files).
1123
Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d
1124
(the UNIX command interpreter) to generate the lists:
1125
.P1
1126
B <echo *.c
1127
.P2
1128
will add all C source files, and
1129
.P1
1130
B <grep -l variable *.c
1131
.P2
1132
will add all C source files referencing a particular variable
1133
(the UNIX command
1134
.CW grep\ -l
1135
lists all files in its arguments that contain matches of
1136
the specified regular expression).
1137
Finally,
1138
.CW D
1139
without arguments deletes the current file.
1140
.PP
1141
There are two ways to change which file is current:
1142
.P1
1143
b filename
1144
.P2
1145
makes the named file current.
1146
The
1147
.CW B
1148
command
1149
does the same, but also adds any new files to
1150
.CW sam 's
1151
list.
1152
(In practice, of course, the current file
1153
is usually chosen by mouse actions, not by textual commands.)
1154
The other way is to use a form of address that refers to files:
1155
.P1
1156
"\f2expression\fP" \f2address\fP
1157
.P2
1158
refers to the address evaluated in the file whose menu line
1159
matches the expression (there must be exactly one match).
1160
For example,
1161
.P1
1162
"peter.c" 3
1163
.P2
1164
refers to the third line of the file whose name matches
1165
.CW peter.c .
1166
This is most useful in the move
1167
.CW m ) (
1168
and copy
1169
.CW t ) (
1170
commands:
1171
.P1
1172
0,$ t "peter.c" 0
1173
.P2
1174
makes a copy of the current file at the beginning of
1175
.CW peter.c .
1176
.PP
1177
The
1178
.CW X
1179
command
1180
is a looping construct, like
1181
.CW x ,
1182
that refers to files instead of strings:
1183
.P1
1184
X/\f2expression\fP/ \f2command\fP
1185
.P2
1186
runs the command in all
1187
files whose menu lines match the expression.  The best example is
1188
.P1
1189
X/'/ w
1190
.P2
1191
which writes to disc all modified files.
1192
.CW Y
1193
is the complement of
1194
.CW X :
1195
it runs the command on all files whose menu lines don't match the expression:
1196
.P1
1197
Y/\e.c/ D
1198
.P2
1199
deletes all files that don't have
1200
.CW \&.c
1201
in their names, that is, it keeps all C source files and deletes the rest.
1202
.PP
1203
Braces allow commands to be grouped, so
1204
.P1
1205
{
1206
	\f2command1\fP
1207
	\f2command2\fP
1208
}
1209
.P2
1210
is syntactically a single command that runs two commands.
1211
Thus,
1212
.P1
1213
X/\e.c/ ,g/variable/ {
1214
	f
1215
	, x/.*\en/ g/variable/ p
1216
}
1217
.P2
1218
finds all occurrences of
1219
.CW variable
1220
in C source files, and prints
1221
out the file names and lines of each match.
1222
The precise semantics of compound operations is discussed in the implementation
1223
sections below.
1224
.PP
1225
Finally,
1226
the undo command,
1227
.CW u ,
1228
undoes the last command,
1229
no matter how many files were affected.
1230
Multiple undo operations move further back in time, so
1231
.P1
1232
u
1233
u
1234
.P2
1235
(which may be abbreviated
1236
.CW u2 )
1237
undoes the last two commands.  An undo may not be undone, however, nor
1238
may any command that adds or deletes files.
1239
Everything else is undoable, though, including for example
1240
.CW e
1241
commands:
1242
.P1
1243
e filename
1244
u
1245
.P2
1246
restores the state of the file completely, including its name, dot,
1247
and modified bit.  Because of the undo, potentially dangerous commands
1248
are not guarded by confirmations.  Only
1249
.CW D ,
1250
which destroys the information necessary to restore itself, is protected.
1251
It will not delete a modified file, but a second
1252
.CW D
1253
of the same file will succeed regardless.
1254
The
1255
.CW q
1256
command, which exits
1257
.CW sam ,
1258
is similarly guarded.
1259
.SH 2
1260
Mouse Interface
1261
.LP
1262
.CW Sam
1263
is most commonly run
1264
connected to a bitmap display and mouse for interactive editing.
1265
The only difference in the command language
1266
between regular, mouse-driven
1267
.CW sam
1268
and
1269
.CW sam\ -d
1270
is that if an address
1271
is provided without a command,
1272
.CW sam\ -d
1273
will print the text referenced by the address, but
1274
regular
1275
.CW sam
1276
will highlight it on the screen \(em in fact,
1277
dot is always highlighted (see Figure 2).
1278
.WS 1
1279
.KF
1280
.XP fig3 2.04i
1281
.Cs
1282
Figure 2. A
1283
.CW sam
1284
window.  The scroll bar down the left
1285
represents the file, with the bubble showing the fraction
1286
visible in the window.
1287
The scroll bar may be manipulated by the mouse for convenient browsing.
1288
The current text,
1289
which is highlighted, need not fit on a line.  Here it consists of one partial
1290
line, one complete line, and final partial line.
1291
.Ce
1292
.KE
1293
.PP
1294
Each file may have zero or more windows open on the display.
1295
At any time, only one window in all of
1296
.CW sam
1297
is the
1298
.I
1299
current window,
1300
.R
1301
that is, the window to which typing and mouse actions refer;
1302
this may be the
1303
.CW sam
1304
window (that in which commands may be typed)
1305
or one of the file windows.
1306
When a file has multiple windows, the image of the file in each window
1307
is always kept up to date.
1308
The current file is the last file affected by a command,
1309
so if the
1310
.CW sam
1311
window is current,
1312
the current window is not a window on the current file.
1313
However, each window on a file has its own value of dot,
1314
and when switching between windows on a single file,
1315
the file's value of dot is changed to that of the window.
1316
Thus, flipping between windows behaves in the obvious, convenient way.
1317
.PP
1318
The mouse on the Blit has three buttons, numbered left to right.
1319
Button 3 has a list of commands to manipulate windows,
1320
followed by a list of `menu lines' exactly as printed by the
1321
.CW f
1322
command, one per file (not one per window).
1323
These menu lines are sorted by file name.
1324
If the list is long, the Blit menu software will make it more manageable
1325
by generating a scrolling menu instead of an unwieldy long list.
1326
Using the menu to select a file from the list makes that file the current
1327
file, and the most recently current window in that file the current window.
1328
But if that file is already current, selecting it in the menu cycles through
1329
the windows on the file; this simple trick avoids a special menu to
1330
choose windows on a file.
1331
If there is no window open on the file,
1332
.CW sam
1333
changes the mouse cursor to prompt the user to create one.
1334
.PP
1335
The commands on the button 3 menu are straightforward (see Figure 3), and
1336
are like the commands to manipulate windows in
1337
.CW mux ,\u\s-4\&8\s+4\d
1338
the Blit's window system.
1339
.CW New
1340
makes a new file, and gives it one empty window, whose size is determined
1341
by a rectangle swept by the mouse.
1342
.CW Zerox
1343
prompts for a window to be selected, and
1344
makes a clone of that window; this is how multiple windows are created on one file.
1345
.CW Reshape
1346
changes the size of the indicated window, and
1347
.CW close
1348
deletes it.  If that is the last window open on the file,
1349
.CW close
1350
first does a
1351
.CW D
1352
command on the file.
1353
.CW Write
1354
is identical to a
1355
.CW w
1356
command on the file; it is in the menu purely for convenience.
1357
Finally,
1358
.CW ~~sam~~
1359
is a menu item that appears between the commands and the file names.
1360
Selecting it makes the
1361
.CW sam
1362
window the current window,
1363
causing subsequent typing to be interpreted as commands.
1364
.KF
1365
.XP fig2 2.74i
1366
.Cs
1367
Figure 3. The menu on button 3.
1368
The black rectangle on the left is a scroll bar; the menu is limited to
1369
the length shown to prevent its becoming unwieldy.
1370
Above the
1371
.CW ~~sam~~
1372
line is a list of commands;
1373
beneath it is a list of files, presented exactly as with the
1374
.CW f
1375
command.
1376
.Ce
1377
.KE
1378
.PP
1379
When
1380
.CW sam
1381
requests that a window be swept, in response to
1382
.CW new ,
1383
.CW zerox
1384
or
1385
.CW reshape ,
1386
it changes the mouse cursor from the usual arrow to a box with
1387
a small arrow.
1388
In this state, the mouse may be used to indicate an arbitrary rectangle by
1389
pressing button 3 at one corner and releasing it at the opposite corner.
1390
More conveniently,
1391
button 3 may simply be clicked,
1392
whereupon
1393
.CW sam
1394
creates the maximal rectangle that contains the cursor
1395
and abuts the
1396
.CW sam
1397
window.
1398
By placing the
1399
.CW sam
1400
window in the middle of the screen, the user can define two regions (one above,
1401
one below) in which stacked fully-overlapping
1402
windows can be created with minimal fuss (see Figure 1).
1403
This simple user interface trick makes window creation noticeably easier.
1404
.PP
1405
The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d
1406
The text in dot is always highlighted on the screen.
1407
When a character is typed it replaces dot, and sets dot to the null
1408
string after the character.  Thus, ordinary typing inserts text.
1409
Button 1 is used for selection:
1410
pressing the button, moving the mouse, and lifting the button
1411
selects (sets dot to) the text between the points where the
1412
button was pressed and released.
1413
Pressing and releasing at the same point selects a null string; this
1414
is called clicking.  Clicking twice quickly, or
1415
.I
1416
double clicking,
1417
.R
1418
selects larger objects;
1419
for example, double clicking in a word selects the word,
1420
double clicking just inside an opening bracket selects the text
1421
contained in the brackets (handling nested brackets correctly),
1422
and similarly for
1423
parentheses, quotes, and so on.
1424
The double-clicking rules reflect a bias toward
1425
programmers.
1426
If
1427
.CW sam
1428
were intended more for word processing, double-clicks would probably
1429
select linguistic structures such as sentences.
1430
.PP
1431
If button 1 is pressed outside the current window, it makes the indicated
1432
window current.
1433
This is the easiest way to switch between windows and files.
1434
.PP
1435
Pressing button 2 brings up a menu of editing functions (see Figure 4).
1436
These mostly apply to the selected text:
1437
.CW cut
1438
deletes the selected text, and remembers it in a hidden buffer called the
1439
.I
1440
snarf buffer,
1441
.R
1442
.CW paste
1443
replaces the selected text by the contents of the snarf buffer,
1444
.CW snarf
1445
just copies the selected text to the snarf buffer,
1446
.CW look
1447
searches forward for the next literal occurrence of the selected text, and
1448
.CW <mux>
1449
exchanges snarf buffers with the window system in which
1450
.CW sam
1451
is running.
1452
Finally, the last regular expression used appears as a menu entry
1453
to search
1454
forward for the next occurrence of a match for the expression.
1455
.WS 1
1456
.KF
1457
.XP fig4 1.20i
1458
.Cs
1459
Figure 4. The menu on button 2.
1460
The bottom entry tracks the most recently used regular expression, which may
1461
be literal text.
1462
.Ce
1463
.KE
1464
.PP
1465
The relationship between the command language and the mouse language is
1466
entirely due to the equality of dot and the selected text chosen
1467
with button 1 on the mouse.
1468
For example, to make a set of changes in a C subroutine, dot can be
1469
set by double clicking on the left brace that begins the subroutine,
1470
which sets dot for the command language.
1471
An address-free command then typed in the
1472
.CW sam
1473
window will apply only to the text between the opening and closing
1474
braces of the function.
1475
The idea is to select what you want, and then say what you want
1476
to do with it, whether invoked by a menu selection or by a typed command.
1477
And of course, the value of dot is highlighted on
1478
the display after the command completes.
1479
This relationship between mouse interface and command language
1480
is clumsy to explain, but comfortable, even natural, in practice.
1481
.SH
1482
The Implementation
1483
.LP
1484
The next few sections describe how
1485
.CW sam
1486
is put together, first the host part,
1487
then the inter-component communication,
1488
then the terminal part.
1489
After explaining how the command language is implemented,
1490
the discussion follows (roughly) the path of a character
1491
from the temporary file on disc to the screen.
1492
The presentation centers on the data structures,
1493
because that is how the program was designed and because
1494
the algorithms are easy to provide, given the right data
1495
structures.
1496
.SH 2
1497
Parsing and execution
1498
.LP
1499
The command language is interpreted by parsing each command with a
1500
table-driven recursive
1501
descent parser, and when a complete command is assembled, invoking a top-down
1502
executor.
1503
Most editors instead employ a simple character-at-a-time
1504
lexical scanner.
1505
Use of a parser makes it
1506
easy and unambiguous to detect when a command is complete,
1507
which has two advantages.
1508
First, escape conventions such as backslashes to quote
1509
multiple-line commands are unnecessary;  if the command isn't finished,
1510
the parser keeps reading.  For example, a multiple-line append driven by an
1511
.CW x
1512
command is straightforward:
1513
.P1
1514
x/.*\en/ g/Peter/ a
1515
one line about Peter
1516
another line about Peter
1517
\&.
1518
.P2
1519
Other UNIX editors would require a backslash after all but the last line.
1520
.PP
1521
The other advantage is specific to the two-process structure of
1522
.CW sam .
1523
The host process must decide when a command is completed so the
1524
command interpreter can be called.  This problem is easily resolved
1525
by having the lexical analyzer read the single stream of events from the
1526
terminal, directly executing all typing and mouse commands,
1527
but passing to the parser characters typed to the
1528
.CW sam
1529
command window.
1530
This scheme is slightly complicated by the availability of cut-and-paste
1531
editing in the
1532
.CW sam
1533
window, but that difficulty is resolved by applying the rules
1534
used in
1535
.CW mux :
1536
when a newline is typed to the
1537
.CW sam
1538
window, all text between the newline and the previously typed newline
1539
is made available to the parser.
1540
This permits arbitrary editing to be done to a command before
1541
typing newline and thereby requesting execution.
1542
.PP
1543
The parser is driven by a table because the syntax of addresses
1544
and commands is regular enough
1545
to be encoded compactly.  There are few special cases, such as the
1546
replacement text in a substitution, so the syntax of almost all commands
1547
can be encoded with a few flags.
1548
These include whether the command allows an address (for example,
1549
.CW e
1550
does not), whether it takes a regular expression (as in
1551
.CW x
1552
and
1553
.CW s ),
1554
whether it takes replacement text (as in
1555
.CW c
1556
or
1557
.CW i ),
1558
which may be multi-line, and so on.
1559
The internal syntax of regular expressions is handled by a separate
1560
parser; a regular expression is a leaf of the command parse tree.
1561
Regular expressions are discussed fully in the next section.
1562
.PP
1563
The parser table also has information about defaults, so the interpreter
1564
is always called with a complete tree.  For example, the parser fills in
1565
the implicit
1566
.CW 0
1567
and
1568
.CW $
1569
in the abbreviated address
1570
.CW ,
1571
(comma),
1572
inserts a
1573
.CW +
1574
to the left of an unadorned regular expression in an address,
1575
and provides the usual default address
1576
.CW .
1577
(dot) for commands that expect an address but are not given one.
1578
.PP
1579
Once a complete command is parsed, the evaluation is easy.
1580
The address is evaluated left-to-right starting from the value of dot,
1581
with a mostly ordinary expression evaluator.
1582
Addresses, like many of the data structures in
1583
.CW sam ,
1584
are held in a C structure and passed around by value:
1585
.P1
1586
typedef long Posn;    /* Position in a file */
1587
typedef struct Range{
1588
        Posn    p1, p2;
1589
}Range;
1590
typedef struct Address{
1591
        Range   r;
1592
        File    *f;
1593
}Address;
1594
.P2
1595
An address is encoded as a substring (character positions
1596
.CW p1
1597
to
1598
.CW p2 )
1599
in a file
1600
.CW f .
1601
(The data type
1602
.CW File
1603
is described in detail below.)
1604
.PP
1605
The address interpreter is an
1606
.CW Address -valued
1607
function that traverses the parse tree describing an address (the
1608
parse tree for the address has type
1609
.CW Addrtree ):
1610
.P1
1611
Address
1612
address(ap, a, sign)
1613
	Addrtree *ap;
1614
	Address a;
1615
	int sign;
1616
{
1617
	Address a2;
1618
	do
1619
		switch(ap->type){
1620
		case '.':
1621
			a=a.f->dot;
1622
			break;
1623
		case '$':
1624
			a.r.p1=a.r.p2=a.f->nbytes;
1625
			break;
1626
		case '"':	
1627
			a=matchfile(a, ap->aregexp)->dot; 
1628
			break;
1629
		case ',':
1630
			a2=address(ap->right, a, 0);
1631
			a=address(ap->left, a, 0);
1632
			if(a.f!=a2.f || a2.r.p2<a.r.p1)
1633
				error(Eorder);
1634
			a.r.p2=a2.r.p2;
1635
			return a;
1636
		/* and so on */
1637
		}
1638
	while((ap=ap->right)!=0);
1639
	return a;
1640
}
1641
.P2
1642
.PP
1643
Throughout, errors are handled by a non-local
1644
.CW goto
1645
(a
1646
.CW setjmp/longjmp
1647
in C terminology)
1648
hidden in a routine called
1649
.CW error
1650
that immediately aborts the execution, retracts any
1651
partially made changes (see the section below on `undoing'), and
1652
returns to the top level of the parser.
1653
The argument to
1654
.CW error
1655
is an enumeration type that
1656
is translated to a terse but possibly helpful
1657
message such as `?addresses out of order.'
1658
Very common messages are kept short; for example the message for
1659
a failed regular expression search is `?search.'
1660
.PP
1661
Character addresses such as
1662
.CW #3
1663
are trivial to implement, as the
1664
.CW File
1665
data structure is accessible by character number.
1666
However,
1667
.CW sam
1668
keeps no information about the position of newlines \(em it is too
1669
expensive to track dynamically \(em so line addresses are computed by reading
1670
the file, counting newlines.  Except in very large files, this has proven
1671
acceptable: file access is fast enough to make the technique practical,
1672
and lines are not central to the structure of the command language.
1673
.PP
1674
The command interpreter, called
1675
.CW cmdexec ,
1676
is also straightforward.  The parse table includes a
1677
function to call to interpret a particular command.  That function
1678
receives as arguments
1679
the calculated address
1680
for the command
1681
and the command tree (of type
1682
.CW Cmdtree ),
1683
which may contain information such as the subtree for compound commands.
1684
Here, for example, is the function for the
1685
.CW g
1686
and
1687
.CW v
1688
commands:
1689
.P1
1690
int
1691
g_cmd(a, cp)
1692
	Address a;
1693
	Cmdtree *cp;
1694
{
1695
	compile(cp->regexp);
1696
	if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){
1697
		a.f->dot=a;
1698
		return cmdexec(a, cp->subcmd);
1699
	}
1700
	return TRUE;	/* cause execution to continue */
1701
}
1702
.P2
1703
.CW Compile "" (
1704
and
1705
.CW execute
1706
are part of the regular expression code, described in the next section.)
1707
Because the parser and the
1708
.CW File
1709
data structure do most of the work, most commands
1710
are similarly brief.
1711
.SH 2
1712
Regular expressions
1713
.LP
1714
The regular expression code in
1715
.CW sam
1716
is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
1717
non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d
1718
The syntax and semantics of the expressions are as in the UNIX program
1719
.CW egrep ,
1720
including alternation, closures, character classes, and so on.
1721
The only changes in the notation are two additions:
1722
.CW \en
1723
is translated to, and matches, a newline character, and
1724
.CW @
1725
matches any character.  In
1726
.CW egrep ,
1727
the character
1728
.CW \&.
1729
matches any character except newline, and in
1730
.CW sam
1731
the same rule seemed safest, to prevent idioms like
1732
.CW \&.*
1733
from spanning newlines.
1734
.CW Egrep
1735
expressions are arguably too complicated for an interactive editor \(em
1736
certainly it would make sense if all the special characters were two-character
1737
sequences, so that most of the punctuation characters wouldn't have
1738
peculiar meanings \(em but for an interesting command language, full
1739
regular expressions are necessary, and
1740
.CW egrep
1741
defines the full regular expression syntax for UNIX programs.
1742
Also, it seemed superfluous to define a new syntax, since various UNIX programs
1743
.CW ed , (
1744
.CW egrep
1745
and
1746
.CW vi )
1747
define too many already.
1748
.PP
1749
The expressions are compiled by a routine,
1750
.CW compile ,
1751
that generates the description of the non-deterministic finite state machine.
1752
A second routine,
1753
.CW execute ,
1754
interprets the machine to generate the leftmost-longest match of the
1755
expression in a substring of the file.
1756
The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d
1757
.CW Execute
1758
reports
1759
whether a match was found, and sets a global variable,
1760
of type
1761
.CW Range ,
1762
to the substring matched.
1763
.PP
1764
A trick is required to evaluate the expression in reverse, such as when
1765
searching backwards for an expression.
1766
For example,
1767
.P1
1768
-/P.*r/
1769
.P2
1770
looks backwards through the file for a match of the expression.
1771
The expression, however, is defined for a forward search.
1772
The solution is to construct a machine identical to the machine
1773
for a forward search except for a reversal of all the concatenation
1774
operators (the other operators are symmetric under direction reversal),
1775
to exchange the meaning of the operators
1776
.CW ^
1777
and
1778
.CW $ ,
1779
and then to read the file backwards, looking for the
1780
usual earliest longest match.
1781
.PP
1782
.CW Execute
1783
generates only one match each time it is called.
1784
To interpret looping constructs such as the
1785
.CW x
1786
command,
1787
.CW sam
1788
must therefore synchronize between
1789
calls of
1790
.CW execute
1791
to avoid
1792
problems with null matches.
1793
For example, even given the leftmost-longest rule,
1794
the expression
1795
.CW a*
1796
matches three times in the string
1797
.CW ab
1798
(the character
1799
.CW a ,
1800
the null string between the
1801
.CW a
1802
and
1803
.CW b ,
1804
and the final null string).
1805
After returning a match for the
1806
.CW a ,
1807
.CW sam
1808
must not match the null string before the
1809
.CW b .
1810
The algorithm starts
1811
.CW execute
1812
at the end of its previous match, and
1813
if the match it returns
1814
is null and abuts the previous match, rejects the match and advances
1815
the initial position one character.
1816
.SH 2
1817
Memory allocation
1818
.LP
1819
The C language has no memory allocation primitives, although a standard
1820
library routine,
1821
.CW malloc ,
1822
provides adequate service for simple programs.
1823
For specific uses, however,
1824
it can be better to write a custom allocator.
1825
The allocator (or rather, pair of allocators) described here
1826
work in both the terminal and host parts of
1827
.CW sam .
1828
They are designed for efficient manipulation of strings,
1829
which are allocated and freed frequently and vary in length from essentially
1830
zero to 32 Kbytes (very large strings are written to disc).
1831
More important, strings may be large and change size often,
1832
so to minimize memory usage it is helpful to reclaim and to coalesce the
1833
unused portions of strings when they are truncated.
1834
.PP
1835
Objects to be allocated in
1836
.CW sam
1837
are of two flavors:
1838
the first is C
1839
.CW structs ,
1840
which are small and often addressed by pointer variables;
1841
the second is variable-sized arrays of characters
1842
or integers whose
1843
base pointer is always used to access them.
1844
The memory allocator in
1845
.CW sam
1846
is therefore in two parts:
1847
first, a traditional first-fit allocator that provides fixed storage for
1848
.CW structs ;
1849
and second, a garbage-compacting allocator that reduces storage
1850
overhead for variable-sized objects, at the cost of some bookkeeping.
1851
The two types of objects are allocated from adjoining arenas, with
1852
the garbage-compacting allocator controlling the arena with higher addresses.
1853
Separating into two arenas simplifies compaction and prevents fragmentation due
1854
to immovable objects.
1855
The access rules for garbage-compactable objects
1856
(discussed in the next paragraph) allow them to be relocated, so when
1857
the first-fit arena needs space, it moves the garbage-compacted arena
1858
to higher addresses to make room.  Storage is therefore created only
1859
at successively higher addresses, either when more garbage-compacted
1860
space is needed or when the first-fit arena pushes up the other arena.
1861
.PP
1862
Objects that may be compacted declare to the
1863
allocator a cell that is guaranteed to be the sole repository of the
1864
address of the object whenever a compaction can occur.
1865
The compactor can then update the address when the object is moved.
1866
For example, the implementation of type
1867
.CW List
1868
(really a variable-length array)
1869
is:
1870
.P1
1871
typedef struct List{
1872
        int     nused;
1873
        long    *ptr;
1874
}List;
1875
.P2
1876
The
1877
.CW ptr
1878
cell must always be used directly, and never copied.  When a
1879
.CW List
1880
is to be created the
1881
.CW List
1882
structure is allocated in the ordinary first-fit arena
1883
and its
1884
.CW ptr
1885
is allocated in the garbage-compacted arena.
1886
A similar data type for strings, called
1887
.CW String ,
1888
stores variable-length character arrays of up to 32767 elements.
1889
.PP
1890
A related matter of programming style:
1891
.CW sam
1892
frequently passes structures by value, which
1893
simplifies the code.
1894
Traditionally, C programs have
1895
passed structures by reference, but implicit allocation on
1896
the stack is easier to use.
1897
Structure passing is a relatively new feature of C
1898
(it is not in the 
1899
standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most
1900
commercial C compilers.
1901
It's convenient and expressive, though,
1902
and simplifies memory management by
1903
avoiding the allocator altogether
1904
and eliminating pointer aliases.
1905
.SH 2
1906
Data structures for manipulating files
1907
.LP
1908
Experience with
1909
.CW jim
1910
showed that the requirements
1911
of the file data structure were few, but strict.
1912
First, files need to be read and written quickly;
1913
adding a fresh file must be painless.
1914
Second, the implementation must place no arbitrary upper limit on
1915
the number or sizes of files.  (It should be practical to edit many files,
1916
and files up to megabytes in length should be handled gracefully.)
1917
This implies that files be stored on disc, not in main memory.
1918
(Aficionados of virtual memory may argue otherwise, but the
1919
implementation of virtual
1920
memory in our system is not something to depend on
1921
for good performance.)
1922
Third, changes to files need be made by only two primitives:
1923
deletion and insertion.
1924
These are inverses of each other,
1925
which simplifies the implementation of the undo operation.
1926
Finally,
1927
it must be easy and efficient to access the file, either
1928
forwards or backwards, a byte at a time.
1929
.PP
1930
The
1931
.CW File
1932
data type is constructed from three simpler data structures that hold arrays
1933
of characters.
1934
Each of these types has an insertion and deletion operator, and the
1935
insertion and deletion operators of the
1936
.CW File
1937
type itself are constructed from them.
1938
.PP
1939
The simplest type is the
1940
.CW String ,
1941
which is used to hold strings in main memory.
1942
The code that manages
1943
.CW Strings
1944
guarantees that they will never be longer
1945
than some moderate size, and in practice they are rarely larger than 8 Kbytes.
1946
.CW Strings
1947
have two purposes: they hold short strings like file names with little overhead,
1948
and because they are deliberately small, they are efficient to modify.
1949
They are therefore used as the data structure for in-memory caches.
1950
.PP
1951
The disc copy of the file is managed by a data structure called a
1952
.CW Disc ,
1953
which corresponds to a temporary file.  A
1954
.CW Disc
1955
has no storage in main memory other than bookkeeping information;
1956
the actual data being held is all on the disc.
1957
To reduce the number of open files needed,
1958
.CW sam
1959
opens a dozen temporary UNIX files and multiplexes the
1960
.CW Discs
1961
upon them.
1962
This permits many files to
1963
be edited; the entire
1964
.CW sam
1965
source (48 files) may be edited comfortably with a single
1966
instance of
1967
.CW sam .
1968
Allocating one temporary file per
1969
.CW Disc
1970
would strain the operating system's limit on the number of open files.
1971
Also, spreading the traffic among temporary files keeps the files shorter,
1972
and shorter files are more efficiently implemented by the UNIX
1973
I/O subsystem.
1974
.PP
1975
A
1976
.CW Disc
1977
is an array of fixed-length blocks, each of which contains
1978
between 1 and 4096 characters of active data.
1979
(The block size of our UNIX file system is 4096 bytes.)
1980
The block addresses within the temporary file and the length of each
1981
block are stored in a
1982
.CW List .
1983
When changes are made the live part of blocks may change size.
1984
Blocks are created and coalesced when necessary to try to keep the sizes
1985
between 2048 and 4096 bytes.
1986
An actively changing part of the
1987
.CW Disc
1988
therefore typically has about a kilobyte of slop that can be
1989
inserted or deleted
1990
without changing more than one block or affecting the block order.
1991
When an insertion would overflow a block, the block is split, a new one
1992
is allocated to receive the overflow, and the memory-resident list of blocks
1993
is rearranged to reflect the insertion of the new block.
1994
.PP
1995
Obviously, going to the disc for every modification to the file is
1996
prohibitively expensive.
1997
The data type
1998
.CW Buffer
1999
consists of a
2000
.CW Disc
2001
to hold the data and a
2002
.CW String
2003
that acts as a cache.
2004
This is the first of a series of caches throughout the data structures in
2005
.CW sam.
2006
The caches not only improve performance, they provide a way to organize
2007
the flow of data, particularly in the communication between the host
2008
and terminal.
2009
This idea is developed below, in the section on communications.
2010
.PP
2011
To reduce disc traffic, changes to a
2012
.CW Buffer
2013
are mediated by a variable-length string, in memory, that acts as a cache.
2014
When an insertion or deletion is made to a
2015
.CW Buffer ,
2016
if the change can be accommodated by the cache, it is done there.
2017
If the cache becomes bigger than a block because of an insertion,
2018
some of it is written to the
2019
.CW Disc
2020
and deleted from the cache.
2021
If the change does not intersect the cache, the cache is flushed.
2022
The cache is only loaded at the new position if the change is smaller than a block;
2023
otherwise, it is sent directly to the
2024
.CW Disc .
2025
This is because
2026
large changes are typically sequential,
2027
whereupon the next change is unlikely to overlap the current one.
2028
.PP
2029
A
2030
.CW File
2031
comprises a
2032
.CW String
2033
to hold the file name and some ancillary data such as dot and the modified bit.
2034
The most important components, though, are a pair of
2035
.CW Buffers ,
2036
one called the transcript and the other the contents.
2037
Their use is described in the next section.
2038
.PP
2039
The overall structure is shown in Figure 5.
2040
Although it may seem that the data is touched many times on its
2041
way from the
2042
.CW Disc ,
2043
it is read (by one UNIX system call) directly into the cache of the
2044
associated
2045
.CW Buffer ;
2046
no extra copy is done.
2047
Similarly, when flushing the cache, the text is written
2048
directly from the cache to disc.
2049
Most operations act directly on the text in the cache.
2050
A principle applied throughout
2051
.CW sam
2052
is that the fewer times the data is copied, the faster the program will run
2053
(see also the paper by Waite\u\s-4\&15\s+4\d).
2054
.KF
2055
.PS
2056
copy "fig5.pic"
2057
.PE
2058
.Cs
2059
Figure 5. File data structures.
2060
The temporary files are stored in the standard repository for such files
2061
on the host system.
2062
.Ce
2063
.KE
2064
.PP
2065
The contents of a
2066
.CW File
2067
are accessed by a routine that
2068
copies to a buffer a substring of a file starting at a specified offset.
2069
To read a byte at a time, a
2070
.CW File "" per-
2071
array is loaded starting from a specified initial position,
2072
and bytes may then be read from the array.
2073
The implementation is done by a macro similar to the C standard I/O
2074
.CW getc
2075
macro.\u\s-4\&14\s+4\d
2076
Because the reading may be done at any address, a minor change to the
2077
macro allows the file to be read backwards.
2078
This array is read-only; there is no
2079
.CW putc .
2080
.SH 2
2081
Doing and undoing
2082
.LP
2083
.CW Sam
2084
has an unusual method for managing changes to files.
2085
The command language makes it easy to specify multiple variable-length changes
2086
to a file millions of bytes long, and such changes
2087
must be made efficiently if the editor is to be practical.
2088
The usual techniques for inserting and deleting strings
2089
are inadequate under these conditions.
2090
The
2091
.CW Buffer
2092
and
2093
.CW Disc
2094
data structures are designed for efficient random access to long strings,
2095
but care must be taken to avoid super-linear behavior when making
2096
many changes simultaneously.
2097
.PP
2098
.CW Sam
2099
uses a two-pass algorithm for making changes, and treats each file as a database
2100
against which transactions are registered.
2101
Changes are not made directly to the contents.
2102
Instead, when a command is started, a `mark' containing
2103
a sequence number is placed in the transcript
2104
.CW Buffer ,
2105
and each change made to the file, either an insertion or deletion
2106
or a change to the file name,
2107
is appended to the end of the transcript.
2108
When the command is complete, the transcript is rewound to the
2109
mark and applied to the contents.
2110
.PP
2111
One reason for separating evaluation from
2112
application in this way is to simplify tracking the addresses of changes
2113
made in the middle of a long sequence.
2114
The two-pass algorithm also allows all changes to apply to the
2115
.I original
2116
data: no change can affect another change made in the same command.
2117
This is particularly important when evaluating an
2118
.CW x
2119
command because it prevents regular expression matches
2120
from stumbling over changes made earlier in the execution.
2121
Also, the two-pass
2122
algorithm is cleaner than the way other UNIX editors allow changes to
2123
affect each other;
2124
for example,
2125
.CW ed 's
2126
idioms to do things like delete every other line
2127
depend critically on the implementation.
2128
Instead,
2129
.CW sam 's
2130
simple model, in which all changes in a command occur effectively
2131
simultaneously, is easy to explain and to understand.
2132
.PP
2133
The records in the transcript are of the form ``delete substring from
2134
locations
2135
123 to 456'' and ``insert 11 characters `hello there' at location 789.''
2136
(It is an error if the changes are not at monotonically greater
2137
positions through the file.)
2138
While the update is occurring, these numbers must be
2139
offset by earlier changes, but that is straightforward and
2140
local to the update routine;
2141
moreover, all the numbers have been computed
2142
before the first is examined.
2143
.PP
2144
Treating the file as a transaction system has another advantage:
2145
undo is trivial.
2146
All it takes is to invert the transcript after it has been
2147
implemented, converting insertions
2148
into deletions and vice versa, and saving them in a holding
2149
.CW Buffer .
2150
The `do' transcript can then be deleted from
2151
the transcript
2152
.CW Buffer
2153
and replaced by the `undo' transcript.
2154
If an undo is requested, the transcript is rewound and the undo transcript
2155
executed.
2156
Because the transcript
2157
.CW Buffer
2158
is not truncated after each command, it accumulates
2159
successive changes.
2160
A sequence of undo commands
2161
can therefore back up the file arbitrarily,
2162
which is more helpful than the more commonly implemented self-inverse form of undo.
2163
.CW Sam "" (
2164
provides no way to undo an undo, but if it were desired,
2165
it would be easy to provide by re-interpreting the `do' transcript.)
2166
Each mark in the transcript contains a sequence number and the offset into
2167
the transcript of the previous mark, to aid in unwinding the transcript.
2168
Marks also contain the value of dot and the modified bit so these can be
2169
restored easily.
2170
Undoing multiple files is easy; it merely demands undoing all files whose
2171
latest change has the same sequence number as the current file.
2172
.PP
2173
Another benefit of having a transcript is that errors encountered in the middle
2174
of a complicated command need not leave the files in an intermediate state.
2175
By rewinding the transcript to the mark beginning the command,
2176
the partial command can be trivially undone.
2177
.PP
2178
When the update algorithm was first implemented, it was unacceptably slow,
2179
so a cache was added to coalesce nearby changes,
2180
replacing multiple small changes by a single larger one.
2181
This reduced the number
2182
of insertions into the transaction
2183
.CW Buffer ,
2184
and made a dramatic improvement in performance,
2185
but made it impossible
2186
to handle changes in non-monotonic order in the file; the caching method
2187
only works if changes don't overlap.
2188
Before the cache was added, the transaction could in principle be sorted
2189
if the changes were out of order, although
2190
this was never done.
2191
The current status is therefore acceptable performance with a minor
2192
restriction on global changes, which is sometimes, but rarely, an annoyance.
2193
.PP
2194
The update algorithm obviously paws the data more than simpler
2195
algorithms, but it is not prohibitively expensive;
2196
the caches help.
2197
(The principle of avoiding copying the data is still honored here,
2198
although not as piously:
2199
the data is moved from contents' cache to
2200
the transcript's all at once and through only one internal buffer.)
2201
Performance figures confirm the efficiency.
2202
To read from a dead start a hundred kilobyte file on a VAX-11/750
2203
takes 1.4 seconds of user time, 2.5 seconds of system time,
2204
and 5 seconds of real time.
2205
Reading the same file in
2206
.CW ed
2207
takes 6.0 seconds of user time, 1.7 seconds of system time,
2208
and 8 seconds of real time.
2209
.CW Sam
2210
uses about half the CPU time.
2211
A more interesting example is the one stated above:
2212
inserting a character between every pair of characters in the file.
2213
The
2214
.CW sam
2215
command is
2216
.P1
2217
,y/@/ a/x/
2218
.P2
2219
and takes 3 CPU seconds per kilobyte of input file, of which
2220
about a third is spent in the regular expression code.
2221
This translates to about 500 changes per second.
2222
.CW Ed
2223
takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
2224
but cannot undo it.
2225
The same example in
2226
.CW ex ,\u\s-4\&9\s+4\d
2227
a variant of
2228
.CW ed
2229
done at the University of California at Berkeley,
2230
which allows one level of undoing, again takes 3 seconds.
2231
In summary,
2232
.CW sam 's
2233
performance is comparable to that of other UNIX editors, although it solves
2234
a harder problem.
2235
.SH 2
2236
Communications
2237
.LP
2238
The discussion so far has described the implementation of the host part of
2239
.CW sam ;
2240
the next few sections explain how a machine with mouse and bitmap display
2241
can be engaged to improve interaction.
2242
.CW Sam
2243
is not the first editor to be written as two processes,\u\s-4\&16\s+4\d
2244
but its implementation
2245
has some unusual aspects.
2246
.PP
2247
There are several ways
2248
.CW sam 's
2249
host and terminal parts may be connected.
2250
The first and simplest is to forgo the terminal part and use the host
2251
part's command language to edit text on an ordinary terminal.
2252
This mode is invoked by starting
2253
.CW sam
2254
with the
2255
.CW -d
2256
option.
2257
With no options,
2258
.CW sam
2259
runs separate host and terminal programs,
2260
communicating with a message protocol over the physical
2261
connection that joins them.
2262
Typically, the connection is an RS-232 link between a Blit
2263
(the prototypical display for
2264
.CW sam )
2265
and a host running
2266
the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d
2267
(This is the version of the system used in the Computing Sciences Research
2268
Center at AT&T Bell Laboratories [now Lucent Technologies, Bell Labs], where I work.  Its relevant
2269
aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d)
2270
The implementation of
2271
.CW sam
2272
for the SUN computer runs both processes on the same machine and
2273
connects them by a pipe.
2274
.PP
2275
The low bandwidth of an RS-232 link
2276
necessitated the split between
2277
the two programs.
2278
The division is a mixed blessing:
2279
a program in two parts is much harder to write and to debug
2280
than a self-contained one,
2281
but the split makes several unusual configurations possible.
2282
The terminal may be physically separated from the host, allowing the conveniences
2283
of a mouse and bitmap display to be taken home while leaving the files at work.
2284
It is also possible to run the host part on a remote machine:
2285
.P1
2286
sam -r host
2287
.P2
2288
connects to the terminal in the usual way, and then makes a call
2289
across the network to establish the host part of
2290
.CW sam
2291
on the named machine.
2292
Finally, it cross-connects the I/O to join the two parts.
2293
This allows
2294
.CW sam
2295
to be run on machines that do not support bitmap displays;
2296
for example,
2297
.CW sam
2298
is the editor of choice on our Cray X-MP/24.
2299
.CW Sam
2300
.CW -r
2301
involves
2302
.I three
2303
machines: the remote host, the terminal, and the local host.
2304
The local host's job is simple but vital: it passes the data
2305
between the remote host and terminal.
2306
.PP
2307
The host and terminal exchange messages asynchronously
2308
(rather than, say, as remote procedure calls) but there is no
2309
error detection or correction
2310
because, whatever the configuration, the connection is reliable.
2311
Because the terminal handles mundane interaction tasks such as
2312
popping up menus and interpreting the responses, the messages are about
2313
data, not actions.
2314
For example, the host knows nothing about what is displayed on the screen,
2315
and when the user types a character, the message sent to the host says
2316
``insert a one-byte string at location 123 in file 7,'' not ``a character
2317
was typed at the current position in the current file.''
2318
In other words, the messages look very much like the transaction records
2319
in the transcripts.
2320
.PP
2321
Either the host or terminal part of
2322
.CW sam
2323
may initiate a change to a file.
2324
The command language operates on the host, while typing and some
2325
mouse operations are executed directly in the terminal to optimize response.
2326
Changes initiated by the host program must be transmitted to the terminal,
2327
and
2328
vice versa.
2329
(A token is exchanged to determine which end is in control,
2330
which means that characters typed while a time-consuming command runs
2331
must be buffered and do not appear until the command is complete.)
2332
To maintain consistent information,
2333
the host and terminal track changes through a per-file
2334
data structure that records what portions of the file
2335
the terminal has received.
2336
The data structure, called a
2337
.CW Rasp
2338
(a weak pun: it's a file with holes)
2339
is held and updated by both the host and terminal.
2340
A
2341
.CW Rasp
2342
is a list of
2343
.CW Strings
2344
holding those parts of the file known to the terminal,
2345
separated by counts of the number of bytes in the interstices.
2346
Of course, the host doesn't keep a separate copy of the data (it only needs
2347
the lengths of the various pieces),
2348
but the structure is the same on both ends.
2349
.PP
2350
The
2351
.CW Rasp
2352
in the terminal doubles as a cache.
2353
Since the terminal keeps the text for portions of the file it has displayed,
2354
it need not request data from the host when revisiting old parts of the file
2355
or redrawing obscured windows, which speeds things up considerably
2356
over low-speed links.
2357
.PP
2358
It's trivial for the terminal to maintain its
2359
.CW Rasp ,
2360
because all changes made on the terminal apply to parts of the file
2361
already loaded there.
2362
Changes made by the host are compared against the
2363
.CW Rasp
2364
during the update sequence after each command.
2365
Small changes to pieces of the file loaded in the terminal
2366
are sent in their entirety.
2367
Larger changes, and changes that fall entirely in the holes,
2368
are transmitted as messages without literal data:
2369
only the lengths of the deleted and inserted strings are transmitted.
2370
When a command is completed, the terminal examines its visible
2371
windows to see if any holes in their
2372
.CW Rasps
2373
intersect the visible portion of the file.
2374
It then requests the missing data from the host,
2375
along with up to 512 bytes of surrounding data, to minimize
2376
the number of messages when visiting a new portion of the file.
2377
This technique provides a kind of two-level lazy evaluation for the terminal.
2378
The first level sends a minimum of information about
2379
parts of the file not being edited interactively;
2380
the second level waits until a change is displayed before
2381
transmitting the new data.
2382
Of course,
2383
performance is also helped by having the terminal respond immediately to typing
2384
and simple mouse requests.
2385
Except for small changes to active pieces of the file, which are
2386
transmitted to the terminal without negotiation,
2387
the terminal is wholly responsible for deciding what is displayed;
2388
the host uses the
2389
.CW Rasp
2390
only to tell the terminal what might be relevant.
2391
.PP
2392
When a change is initiated by the host,
2393
the messages to the terminal describing the change
2394
are generated by the routine that applies the transcript of the changes
2395
to the contents of the
2396
.CW File .
2397
Since changes are undone by the same update routine,
2398
undoing requires
2399
no extra code in the communications;
2400
the usual messages describing changes to the file are sufficient
2401
to back up the screen image.
2402
.PP
2403
The
2404
.CW Rasp
2405
is a particularly good example of the way caches are used in
2406
.CW sam .
2407
First, it facilitates access to the active portion of the text by placing
2408
the busy text in main memory.
2409
In so doing, it provides efficient access
2410
to a large data structure that does not fit in memory.
2411
Since the form of data is to be imposed by the user, not by the program,
2412
and because characters will frequently be scanned sequentially,
2413
files are stored as flat objects.
2414
Caches help keep performance good and linear when working with such
2415
data.
2416
.PP
2417
Second, the
2418
.CW Rasp
2419
and several of the other caches have some
2420
.I read-ahead;
2421
that is, the cache is loaded with more information than is needed for
2422
the job immediately at hand.
2423
When manipulating linear structures, the accesses are usually sequential,
2424
and read-ahead can significantly reduce the average time to access the
2425
next element of the object.
2426
Sequential access is a common mode for people as well as programs;
2427
consider scrolling through a document while looking for something.
2428
.PP
2429
Finally, like any good data structure,
2430
the cache guides the algorithm, or at least the implementation.
2431
The
2432
.CW Rasp
2433
was actually invented to control the communications between the host and
2434
terminal parts, but I realized very early that it was also a form of
2435
cache.  Other caches were more explicitly intended to serve a double
2436
purpose: for example, the caches in
2437
.CW Files
2438
that coalesce updates not only reduce traffic to the
2439
transcript and contents
2440
.CW Buffers ,
2441
they also clump screen updates so that complicated changes to the
2442
screen are achieved in
2443
just a few messages to the terminal.
2444
This saved me considerable work: I did not need to write special
2445
code to optimize the message traffic to the
2446
terminal.
2447
Caches pay off in surprising ways.
2448
Also, they tend to be independent, so their performance improvements
2449
are multiplicative.
2450
.SH 2
2451
Data structures in the terminal
2452
.LP
2453
The terminal's job is to display and to maintain a consistent image of
2454
pieces of the files being edited.
2455
Because the text is always in memory, the data structures are
2456
considerably simpler than those in the host part.
2457
.PP
2458
.CW Sam
2459
typically has far more windows than does
2460
.CW mux ,
2461
the window system within which its Blit implementation runs.
2462
.CW Mux
2463
has a fairly small number of asynchronously updated windows;
2464
.CW sam
2465
needs a large number of synchronously updated windows that are
2466
usually static and often fully obscured.
2467
The different tradeoffs guided
2468
.CW sam
2469
away from the memory-intensive implementation of windows, called
2470
.CW Layers ,\u\s-4\&17\s+4\d
2471
used in
2472
.CW mux.
2473
Rather than depending on a complete bitmap image of the display for each window,
2474
.CW sam
2475
regenerates the image from its in-memory text
2476
(stored in the
2477
.CW Rasp )
2478
when necessary, although it will use such an image if it is available.
2479
Like
2480
.CW Layers ,
2481
though,
2482
.CW sam
2483
uses the screen bitmap as active storage in which to update the image using
2484
.CW bitblt .\u\s-4\&18,19\s+4\d
2485
The resulting organization, pictured in Figure 6,
2486
has a global array of windows, called
2487
.CW Flayers ,
2488
each of which holds an image of a piece of text held in a data structure
2489
called a
2490
.CW Frame ,
2491
which in turn represents
2492
a rectangular window full of text displayed in some
2493
.CW Bitmap .
2494
Each
2495
.CW Flayer
2496
appears in a global list that orders them all front-to-back
2497
on the display, and simultaneously as an element of a per-file array
2498
that holds all the open windows for that file.
2499
The complement in the terminal of the
2500
.CW File
2501
on the host is called a
2502
.CW Text ;
2503
each connects its
2504
.CW Flayers
2505
to the associated
2506
.CW Rasp .
2507
.KF
2508
.PS
2509
copy "fig6.pic"
2510
.PE
2511
.Cs
2512
Figure 6. Data structures in the terminal.
2513
.CW Flayers
2514
are also linked together into a front-to-back list.
2515
.CW Boxes
2516
are discussed in the next section.
2517
.Ce
2518
.KE
2519
.PP
2520
The
2521
.CW Bitmap
2522
for a
2523
.CW Frame
2524
contains the image of the text.
2525
For a fully visible window, the
2526
.CW Bitmap
2527
will be the screen (or at least the
2528
.CW Layer
2529
in which
2530
.CW sam
2531
is being run),
2532
while for partially obscured windows the
2533
.CW Bitmap
2534
will be off-screen.
2535
If the window is fully obscured, the
2536
.CW Bitmap
2537
will be null.
2538
.PP
2539
The
2540
.CW Bitmap
2541
is a kind of cache.
2542
When making changes to the display, most of the original image will
2543
look the same in the final image, and the update algorithms exploit this.
2544
The
2545
.CW Frame
2546
software updates the image in the
2547
.CW Bitmap
2548
incrementally; the
2549
.CW Bitmap
2550
is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d
2551
The job of the software that updates the display is therefore
2552
to use as much as possible of the existing image (converting the
2553
text from ASCII characters to pixels is expensive) in a sort of two-dimensional
2554
string insertion algorithm.
2555
The details of this process are described in the next section.
2556
.PP
2557
The
2558
.CW Frame
2559
software has no code to support overlapping windows;
2560
its job is to keep a single
2561
.CW Bitmap
2562
up to date.
2563
It falls to the
2564
.CW Flayer
2565
software to multiplex the various
2566
.CW Bitmaps
2567
onto the screen.
2568
The problem of maintaining overlapping
2569
.CW Flayers
2570
is easier than for
2571
.CW Layers \u\s-4\&17\s+4\d
2572
because changes are made synchronously and because the contents of the window
2573
can be reconstructed from the data stored in the
2574
.CW Frame ;
2575
the
2576
.CW Layers
2577
software
2578
makes no such assumptions.
2579
In
2580
.CW sam ,
2581
the window being changed is almost always fully visible, because the current
2582
window is always fully visible, by construction.
2583
However, when multi-file changes are being made, or when
2584
more than one window is open on a file,
2585
it may be necessary to update partially obscured windows.
2586
.PP
2587
There are three cases: the window is 
2588
fully visible, invisible (fully obscured), or partially visible.
2589
If fully visible, the
2590
.CW Bitmap
2591
is part of the screen, so when the
2592
.CW Flayer
2593
update routine calls the
2594
.CW Frame
2595
update routine, the screen will be updated directly.
2596
If the window is invisible,
2597
there is no associated
2598
.CW Bitmap ,
2599
and all that is necessary is to update the
2600
.CW Frame
2601
data structure, not the image.
2602
If the window is partially visible, the
2603
.CW Frame
2604
routine is called to update the image in the off-screen
2605
.CW Bitmap ,
2606
which may require regenerating it from the text of the window.
2607
The
2608
.CW Flayer
2609
code then clips this
2610
.CW Bitmap
2611
against the
2612
.CW Bitmaps
2613
of all
2614
.CW Frames
2615
in front of the
2616
.CW Frame
2617
being modified, and the remainder is copied to the display.
2618
.PP
2619
This is much faster than recreating the image off-screen
2620
for every change, or clipping all the changes made to the image
2621
during its update.
2622
Unfortunately, these caches can also consume prohibitive amounts of
2623
memory, so they are freed fairly liberally \(em after every change to the
2624
front-to-back order of the
2625
.CW Flayers .
2626
The result is that
2627
the off-screen
2628
.CW Bitmaps
2629
exist only while multi-window changes are occurring,
2630
which is the only time the performance improvement they provide is needed.
2631
Also, the user interface causes fully-obscured windows to be the
2632
easiest to make \(em
2633
creating a canonically sized and placed window requires only a button click
2634
\(em which reduces the need for caching still further.
2635
.PP
2636
.SH 2
2637
Screen update
2638
.LP
2639
Only two low-level primitives are needed for incremental update:
2640
.CW bitblt ,
2641
which copies rectangles of pixels, and
2642
.CW string
2643
(which in turn calls
2644
.CW bitblt ),
2645
which draws a null-terminated character string in a
2646
.CW Bitmap .
2647
A
2648
.CW Frame
2649
contains a list of
2650
.CW Boxes ,
2651
each of which defines a horizontal strip of text in the window
2652
(see Figure 7).
2653
A
2654
.CW Box
2655
has a character string
2656
.CW str ,
2657
and a
2658
.CW Rectangle
2659
.CW rect
2660
that defines the location of the strip in the window.
2661
(The text in
2662
.CW str
2663
is stored in the
2664
.CW Box
2665
separately from the
2666
.CW Rasp
2667
associated with the window's file, so
2668
.CW Boxes
2669
are self-contained.)
2670
The invariant is that
2671
the image of the
2672
.CW Box
2673
can be reproduced by calling
2674
.CW string
2675
with argument
2676
.CW str
2677
to draw the string in
2678
.CW rect ,
2679
and the resulting picture fits perfectly within
2680
.CW rect .
2681
In other words, the
2682
.CW Boxes
2683
define the tiling of the window.
2684
The tiling may be complicated by long lines of text, which
2685
are folded onto the next line.
2686
Some editors use horizontal scrolling to avoid this complication,
2687
but to be comfortable this technique requires that lines not be
2688
.I too
2689
long;
2690
.CW sam
2691
has no such restriction.
2692
Also, and perhaps more importantly, UNIX programs and terminals traditionally fold
2693
long lines to make their contents fully visible.
2694
.PP
2695
Two special kinds of
2696
.CW Boxes
2697
contain a single
2698
character: either a newline or a tab.
2699
Newlines and tabs are white space.
2700
A newline
2701
.CW Box
2702
always extends to the right edge of the window,
2703
forcing the following
2704
.CW Box
2705
to the next line.
2706
The width of a tab depends on where it is located:
2707
it forces the next
2708
.CW Box
2709
to begin at a tab location.
2710
Tabs also
2711
have a minimum width equivalent to a blank (blanks are
2712
drawn by
2713
.CW string
2714
and are not treated specially); newlines have a minimum width of zero.
2715
.KF
2716
.PS
2717
copy "fig7.pic"
2718
.PE
2719
.sp .5
2720
.Cs
2721
Figure 7. A line of text showing its
2722
.CW Boxes .
2723
The first two blank
2724
.CW Boxes
2725
contain tabs; the last contains a newline.
2726
Spaces are handled as ordinary characters.
2727
.Ce
2728
.KE
2729
.PP
2730
The update algorithms always use the
2731
.CW Bitmap
2732
image of the text (either the display or cache
2733
.CW Bitmap );
2734
they never examine the characters within a
2735
.CW Box
2736
except when the
2737
.CW Box
2738
needs to be split in two.
2739
Before a change, the window consists of a tiling of
2740
.CW Boxes ;
2741
after the change the window is tiled differently.
2742
The update algorithms rearrange the tiles in place, without
2743
backup storage.
2744
The algorithms are not strictly optimal \(em for example, they can
2745
clear a pixel that is later going to be written upon \(em
2746
but they never move a tile that doesn't need to be moved,
2747
and they move each tile at most once.
2748
.CW Frinsert
2749
on a Blit can absorb over a thousand characters a second if the strings
2750
being inserted are a few tens of characters long.
2751
.PP
2752
Consider
2753
.CW frdelete .
2754
Its job is to delete a substring from a
2755
.CW Frame
2756
and restore the image of the
2757
.CW Frame .
2758
The image of a substring has a peculiar shape (see Figure 2) comprising
2759
possibly a partial line,
2760
zero or more full lines,
2761
and possibly a final partial line.
2762
For reference, call this the
2763
.I
2764
Z-shape.
2765
.R
2766
.CW Frdelete
2767
begins by splitting, if necessary, the
2768
.CW Boxes
2769
containing the ends of
2770
the substring so the substring begins and ends on
2771
.CW Box
2772
boundaries.
2773
Because the substring is being deleted, its image is not needed,
2774
so the Z-shape is then cleared.
2775
Then, tiles (that is, the images of
2776
.CW Boxes )
2777
are copied, using
2778
.CW bitblt ,
2779
from immediately after the Z-shape to
2780
the beginning of the Z-shape,
2781
resulting in a new Z-shape.
2782
.CW Boxes "" (
2783
whose contents would span two lines in the new position must first be split.)
2784
.PP
2785
Copying the remainder of the
2786
.CW Frame
2787
tile by tile
2788
this way will clearly accomplish the deletion but eventually,
2789
typically when the copying algorithm encounters a tab or newline,
2790
the old and new
2791
.CW x
2792
coordinates of the tile
2793
to be copied are the same.
2794
This correspondence implies
2795
that the Z-shape has its beginning and ending edges aligned
2796
vertically, and a sequence of at most two
2797
.CW bitblts
2798
can be used to copy the remaining tiles.
2799
The last step is to clear out the resulting empty space at the bottom
2800
of the window;
2801
the number of lines to be cleared is the number of complete lines in the
2802
Z-shape closed by the final
2803
.CW bitblts.
2804
The final step is to merge horizontally adjacent
2805
.CW Boxes
2806
of plain text.
2807
The complete source to
2808
.CW frdelete
2809
is less than 100 lines of C.
2810
.PP
2811
.CW frinsert
2812
is more complicated because it must do four passes:
2813
one to construct the
2814
.CW Box
2815
list for the inserted string,
2816
one to reconnoitre,
2817
one to copy (in opposite order to
2818
.CW frdelete )
2819
the
2820
.CW Boxes
2821
to make the hole for the new text,
2822
and finally one to copy the new text into place.
2823
Overall, though,
2824
.CW frinsert
2825
has a similar flavor to
2826
.CW frdelete ,
2827
and needn't be described further.
2828
.CW Frinsert
2829
and its subsidiary routines comprise 211 lines of C.
2830
.PP
2831
The terminal source code is 3024 lines of C,
2832
and the host source is 5797 lines.
2833
.SH
2834
Discussion
2835
.SH 2
2836
History
2837
.LP
2838
The immediate ancestor of
2839
.CW sam
2840
was the original text editor for the Blit, called
2841
.CW jim .
2842
.CW Sam
2843
inherited
2844
.CW jim 's
2845
two-process structure and mouse language almost unchanged, but
2846
.CW jim
2847
suffered from several drawbacks that were addressed in the design of
2848
.CW sam .
2849
The most important of these was the lack of a command language.
2850
Although
2851
.CW jim
2852
was easy to use for simple editing, it provided no direct help with
2853
large or repetitive editing tasks.  Instead, it provided a command to pass
2854
selected text through a shell pipeline,
2855
but this was no more satisfactory than could be expected of a stopgap measure.
2856
.PP
2857
.CW Jim
2858
was written primarily as a vehicle for experimenting with a mouse-based
2859
interface to text, and the experiment was successful.
2860
.CW Jim
2861
had some spin-offs:
2862
.CW mux ,
2863
the second window system for the Blit, is essentially a multiplexed
2864
version of the terminal part of
2865
.CW jim ;
2866
and the debugger
2867
.CW pi 's
2868
user interface\u\s-4\&20\s+4\d was closely modeled on
2869
.CW jim 's.
2870
But after a couple of years,
2871
.CW jim
2872
had become difficult to maintain and limiting to use,
2873
and its replacement was overdue.
2874
.PP
2875
I began the design of
2876
.CW sam
2877
by asking
2878
.CW jim
2879
customers what they wanted.
2880
This was probably a mistake; the answers were essentially a list of features
2881
to be found in other editors, which did not provide any of the
2882
guiding principles I was seeking.
2883
For instance, one common request was for a ``global substitute,''
2884
but no one suggested how to provide it within a cut-and-paste editor.
2885
I was looking for a scheme that would
2886
support such specialized features comfortably in the context of some
2887
general command language.
2888
Ideas were not forthcoming, though, particularly given my insistence
2889
on removing all limits on file sizes, line lengths and so on.
2890
Even worse, I recognized that, since the mouse could easily
2891
indicate a region of the screen that was not an integral number of lines,
2892
the command language would best forget about newlines altogether,
2893
and that meant the command language had to treat the file as a single
2894
string, not an array of lines.
2895
.PP
2896
Eventually, I decided that thinking was not getting me very far and it was
2897
time to try building.
2898
I knew that the terminal part could be built easily \(em
2899
that part of
2900
.CW jim
2901
behaved acceptably well \(em and that most of the hard work was going
2902
to be in the host part: the file interface, command interpreter and so on.
2903
Moreover, I had some ideas about how the architecture of
2904
.CW jim
2905
could be improved without destroying its basic structure, which I liked
2906
in principle but which hadn't worked out as well as I had hoped.
2907
So I began by designing the file data structure,
2908
starting with the way
2909
.CW jim
2910
worked \(em comparable to a single structure merging
2911
.CW Disc
2912
and
2913
.CW Buffer ,
2914
which I split to make the cache more general
2915
\(em and thinking about how global substitute could be implemented.
2916
The answer was clearly that it had to be done in two passes,
2917
and the transcript-oriented implementation fell out naturally.
2918
.PP
2919
.CW Sam
2920
was written bottom-up,
2921
starting from the data structures and algorithms for manipulating text,
2922
through the command language and up to the code for maintaining
2923
the display.
2924
In retrospect, it turned out well, but this implementation method is
2925
not recommended in general.
2926
There were several times when I had a large body of interesting code
2927
assembled and no clue how to proceed with it.
2928
The command language, in particular, took almost a year to figure out,
2929
but can be implemented (given what was there at the beginning of that year)
2930
in a day or two.  Similarly, inventing the
2931
.CW Rasp
2932
data structure delayed the
2933
connection of the host and terminal pieces by another few months.
2934
.CW Sam
2935
took about two years to write, although only about four months were
2936
spent actually working on it.
2937
.PP
2938
Part of the design process was unusual:
2939
the subset of the protocol that maintains the
2940
.CW Rasp
2941
was simulated, debugged
2942
and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free
2943
from the start.
2944
The rest of the protocol, concerned mostly
2945
with keeping menus up to date,
2946
was unfortunately too unwieldy for such analysis,
2947
and was debugged by more traditional methods, primarily
2948
by logging in a file all messages in and out of the host.
2949
.SH 2
2950
Reflections
2951
.LP
2952
.CW Sam
2953
is essentially the only interactive editor used by the sixty or so members of
2954
the computing science research center in which I work.
2955
The same could not be said of
2956
.CW jim ;
2957
the lack of a command language kept some people from adopting it.
2958
The union of a user interface as comfortable as
2959
.CW jim 's
2960
with a command language as powerful as
2961
.CW ed 's†
2962
.FS
2963
.vs 9
2964
†The people who criticize
2965
.CW ed
2966
as an interactive program often forget that it and its close relative
2967
.CW sed \u\s-4\&7\s+4\d
2968
still thrive as programmable editors.  The strength of these programs is
2969
independent of their convenience for interactive editing.
2970
.br
2971
.vs
2972
.FE
2973
is essential to
2974
.CW sam 's
2975
success.
2976
When
2977
.CW sam
2978
was first made available to the
2979
.CW jim
2980
community,
2981
almost everyone switched to it within two or three days.
2982
In the months that followed, even people who had never adopted
2983
.CW jim
2984
started using
2985
.CW sam
2986
exclusively.
2987
.PP
2988
To be honest,
2989
.CW ed
2990
still gets occasional use, but usually when
2991
something quick needs to be done and the overhead of
2992
downloading the terminal part of
2993
.CW sam
2994
isn't worth the trouble.
2995
Also, as a `line' editor,
2996
.CW sam
2997
.CW -d
2998
is a bit odd;
2999
when using a good old ASCII terminal, it's comforting to have
3000
a true line editor.
3001
But it is fair to say that
3002
.CW sam 's
3003
command language has displaced
3004
.CW ed 's
3005
for most of the complicated editing that has kept line editors
3006
(that is, command-driven editors) with us.
3007
.PP
3008
.CW Sam 's
3009
command language is even fancier than
3010
.CW ed 's,
3011
and most
3012
.CW sam
3013
customers don't come near to using all its capabilities.
3014
Does it need to be so sophisticated?
3015
I think the answer is yes, for two reasons.
3016
.PP
3017
First, the
3018
.I model
3019
for
3020
.CW sam 's
3021
command language is really relatively simple, and certainly simpler than that of
3022
.CW ed .
3023
For instance, there is only one kind of textual loop in
3024
.CW sam
3025
\(em the
3026
.CW x
3027
command \(em
3028
while
3029
.CW ed
3030
has three (the
3031
.CW g
3032
command, the global flag on substitutions, and the implicit loop over
3033
lines in multi-line substitutions).
3034
Also,
3035
.CW ed 's
3036
substitute command is necessary to make changes within lines, but in
3037
.CW sam
3038
the
3039
.CW s
3040
command is more of a familiar convenience than a necessity;
3041
.CW c
3042
and
3043
.CW t
3044
can do all the work.
3045
.PP
3046
Second,
3047
given a community that expects an editor to be about as powerful as
3048
.CW ed ,
3049
it's hard to see how
3050
.CW sam
3051
could really be much simpler and still satisfy that expectation.
3052
People want to do ``global substitutes,'' and most are content
3053
to have the recipe for that and a few other fancy changes.
3054
The sophistication of the command language is really just a veneer
3055
over a design that makes it possible to do global substitutes
3056
in a screen editor.
3057
Some people will always want something more, however, and it's gratifying to
3058
be able to provide it.
3059
The real power of
3060
.CW sam 's
3061
command language comes from composability of the operators, which is by
3062
nature orthogonal to the underlying model.
3063
In other words,
3064
.CW sam
3065
is not itself complex, but it makes complex things possible.
3066
If you don't want to do anything complex, you can ignore the
3067
complexity altogether, and many people do so.
3068
.PP
3069
Sometimes I am asked the opposite question: why didn't I just make
3070
.CW sam
3071
a real programmable editor, with macros and variables and so on?
3072
The main reason is a matter of taste: I like the editor
3073
to be the same every time I use it.
3074
There is one technical reason, though:
3075
programmability in editors is largely a workaround for insufficient
3076
interactivity.
3077
Programmable editors are used to make particular, usually short-term,
3078
things easy to do, such as by providing shorthands for common actions.
3079
If things are generally easy to do in the first place,
3080
shorthands are not as helpful.
3081
.CW Sam
3082
makes common editing operations very easy, and the solutions to
3083
complex editing problems seem commensurate with the problems themselves.
3084
Also, the ability to edit the
3085
.CW sam
3086
window makes it easy to repeat commands \(em it only takes a mouse button click
3087
to execute a command again.
3088
.SH 2
3089
Pros and cons
3090
.LP
3091
.CW Sam
3092
has several other good points,
3093
and its share of problems.
3094
Among the good things is the idea of
3095
structural regular expressions,
3096
whose usefulness has only begun to be explored.
3097
They were arrived at serendipitously when I attempted to distill the essence of
3098
.CW ed 's
3099
way of doing global substitution and recognized that the looping command in
3100
.CW ed
3101
was implicitly imposing a structure (an array of lines) on the file.
3102
.PP
3103
Another of
3104
.CW sam 's
3105
good things is its undo capability.
3106
I had never before used an editor with a true undo,
3107
but I would never go back now.
3108
Undo
3109
.I must
3110
be done well, but if it is, it can be relied on.
3111
For example,
3112
it's safe to experiment if you're not sure how to write some intricate command,
3113
because if you make a mistake, it can be fixed simply and reliably.
3114
I learned two things about undo from writing
3115
.CW sam :
3116
first, it's easy to provide if you design it in from the beginning, and
3117
second, it's necessary, particularly if the system has some subtle
3118
properties that may be unfamiliar or error-prone for users.
3119
.PP
3120
.CW Sam 's
3121
lack of internal limits and sizes is a virtue.
3122
Because it avoids all fixed-size tables and data structures,
3123
.CW sam
3124
is able to make global changes to files that some of our other
3125
tools cannot even read.
3126
Moreover, the design keeps the performance linear when doing such
3127
operations, although I must admit
3128
.CW sam
3129
does get slow when editing a huge file.
3130
.PP
3131
Now, the problems.
3132
Externally, the most obvious is that it is poorly integrated into the
3133
surrounding window system.
3134
By design, the user interface in
3135
.CW sam
3136
feels almost identical to that of
3137
.CW mux ,
3138
but a thick wall separates text in
3139
.CW sam
3140
from the programs running in
3141
.CW mux .
3142
For instance, the `snarf buffer' in
3143
.CW sam
3144
must be maintained separately from that in
3145
.CW mux .
3146
This is regrettable, but probably necessary given the unusual configuration
3147
of the system, with a programmable terminal on the far end of an RS-232 link.
3148
.PP
3149
.CW Sam
3150
is reliable; otherwise, people wouldn't use it.
3151
But it was written over such a long time, and has so many new (to me)
3152
ideas in it, that I would like to see it done over again to clean
3153
up the code and remove many of the lingering problems in the implementation.
3154
The worst part is in the interconnection of the host and terminal parts,
3155
which might even be able to go away in a redesign for a more
3156
conventional window system.
3157
The program must be split in two to use the terminal effectively,
3158
but the low bandwidth of the connection forces the separation to
3159
occur in an inconvenient part of the design if performance is to be acceptable.
3160
A simple remote procedure call
3161
protocol driven by the host, emitting only graphics
3162
commands, would be easy to write but wouldn't have nearly the
3163
necessary responsiveness.  On the other hand, if the terminal were in control
3164
and requested much simpler file services from the host, regular expression
3165
searches would require that the terminal read the entire file over its RS-232
3166
link, which would be unreasonably slow.
3167
A compromise in which either end can take control is necessary.
3168
In retrospect, the communications protocol should have been
3169
designed and verified formally, although I do not know of any tool
3170
that can adequately relate the protocol to
3171
its implementation.
3172
.PP
3173
Not all of
3174
.CW sam 's
3175
users are comfortable with its command language, and few are adept.
3176
Some (venerable) people use a sort of
3177
.CW ed \& ``
3178
subset'' of
3179
.CW sam 's
3180
command language,
3181
and even ask why
3182
.CW sam 's
3183
command language is not exactly
3184
.CW ed 's.
3185
(The reason, of course, is that
3186
.CW sam 's
3187
model for text does not include newlines, which are central to
3188
.CW ed .
3189
Making the text an array of newlines to the command language would
3190
be too much of a break from the seamless model provided by the mouse.
3191
Some editors, such as
3192
.CW vi ,
3193
are willing to make this break, though.)
3194
The difficulty is that
3195
.CW sam 's
3196
syntax is so close to
3197
.CW ed 's
3198
that people believe it
3199
.I should
3200
be the same.
3201
I thought, with some justification in hindsight,
3202
that making
3203
.CW sam
3204
similar to
3205
.CW ed
3206
would make it easier to learn and to accept.
3207
But I may have overstepped and raised the users'
3208
expectations too much.
3209
It's hard to decide which way to resolve this problem.
3210
.PP
3211
Finally, there is a tradeoff in
3212
.CW sam
3213
that was decided by the environment in which it runs:
3214
.CW sam
3215
is a multi-file editor, although in a different system there might instead be
3216
multiple single-file editors.
3217
The decision was made primarily because starting a new program in a Blit is
3218
time-consuming.
3219
If the choice could be made freely, however, I would
3220
still choose the multi-file architecture, because it allows
3221
groups of files to be handled as a unit;
3222
the usefulness of the multi-file commands is incontrovertible.
3223
It is delightful to have the source to an entire program
3224
available at your fingertips.
3225
.SH
3226
Acknowledgements
3227
.LP
3228
Tom Cargill suggested the idea behind the
3229
.CW Rasp
3230
data structure.
3231
Norman Wilson and Ken Thompson influenced the command language.
3232
This paper was improved by comments from
3233
Al Aho,
3234
Jon Bentley,
3235
Chris Fraser,
3236
Gerard Holzmann,
3237
Brian Kernighan,
3238
Ted Kowalski,
3239
Doug McIlroy
3240
and
3241
Dennis Ritchie.