Results 311 to 320 of 398

Thread: How To: Make Your Own Ubuntu Repository DVDs

Thread Tools
- Show Printable Version
- Subscribe to this Thread…
Display
- Linear Mode
- Switch to Hybrid Mode
- Switch to Threaded Mode

February 7th, 2010 #311
luvr

View Profile

View Forum Posts

Private Message

Frothy Coffee!
Join Date

Jan 2006

Location

Boom, Belgium

Beans

222

Distro
Ubuntu 10.04 Lucid Lynx
Having Fun with the "debmirror" Dry Run—and Learning Something New Along the Way.
The Motivation for this Post.

In an earlier post in this thread, NorthernSuze asked a couple of interesting questions:

Is there a way of telling from the dry run, how big the initial update will be?
Is there a way of limiting the initial “debmirror” run to match what is remaining on our bandwidth at the end of the month?

Following are the immediate answers to these questions:

Just like the real run, the dry run will output a line of text that shows the total size of the download—e.g.:

Code:

Download all files that we need to get (28061 MiB).

However, the output from a dry run will scroll by so quickly, that it’s all too easy to miss this line.

To limit a “debmirror” run, you can specify the “--max-batch=number” parameter, where “number” represents the maximum number of files that you wish to download. This parameter does not, however, take into account the actual sizes of the files. Consequently, if “debmirror” happens to select a few huge files, you may end up with a multi-gigabyte download of, say, just five or six files; conversely, if “debmirror” chooses to download a set of small files, your download may consist of perhaps fifty or sixty files for a total of just a few hundred kilobytes.

Obviously, while these answers may be helpful to some extent, they won’t necessarily be perceived as a complete solution—which is why I decided to take a closer look at these issues. The results of my research can be found in this post.

Preparatory Steps.

If you want to follow along with the experiments described in this post, then I’m assuming that you have already correctly set up the “debmirror” program, as documented earlier in this thread.

You will also need to create a directory that will play the role of a new local mirror for the Ubuntu source package archive. Thus, you should create a directory named, e.g., “UbuntuSources” in your home directory—as follows:

Code:

cd mkdir UbuntuSources

Note:

You won’t have to actually download the complete software archive into this new directory; instead, it will simply be used for experimentation, possibly in preparation of a full download.

Note:

For these experiments, I opted for a repository of source (as opposed to binary) packages, because that turned out to be a little more instructive than a binary repository. You can adapt the experiments to make them work on a repository of binary packages instead, but that, as they say, is “left as an exercise for the reader.” In fact, you could subsequently make the experiments work even on a combined (binary plus source) repository—which you can consider a further (and somewhat more advanced) “exercise for the reader.”

After you create the appropriate directory, you should use it as the target location on a “debmirror” dry run—e.g.:

Code:

debmirror --dry-run --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources

Note:

Many of the shell commands in this post will be rather long, and will be split into multiple lines, using the backslash (i.e., “\”) as the line continuation character. When you enter such commands into a shell, you should either type them on one long line (but without the backslashes), or use the backslash as the last character on a line to inform the shell that the command is not yet complete, and that you will continue it on the next line.

You are now ready to start the experiments—the first of which will be about finding out how big a full download would be.

Experiment 1: How Big Would a Full Download Be?

If you performed the dry run as described above, then you will have seen a great number of lines scroll by so fast that you will surely have missed the line that informed you about the total size of the full download—e.g.:

Code:

Download all files that we need to get (28061 MiB).

Even if you try and scroll back through the output, you will most likely no longer find this line, since it will already have been purged from the buffer maintained by the terminal session, due to the huge amount of text that the command produced.

Part 1: Capturing the Output into a File.

The “debmirror” command sends its output to its “Standard Output Stream” (commonly abbreviated as “stdout”), which is connected to the terminal window by default. The classic way to capture “stdout” into a file is called “Output Redirection,” which you activate by means of a “>” sign followed by the target file name—e.g., you can send the output from the “debmirror” dry run to a file named “experiment-1.1” as follows:

Code:

debmirror --dry-run --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources > experiment-1.1

You won’t see any output on-screen, because the output will be redirected to the specified file instead.

After this command completes, you can open the file, “experiment-1.1,” with any text editor, and search for the text “Download all files that we need to get” to find the total download size.

By the way:

In addition to the “Standard Output Stream” (i.e., “stdout”), there also exists a “Standard Error Stream” (i.e., “stderr”), to which programs are expected to output their error messages (though some programs may choose not to adhere to this convention). The “>” redirection operator works only on “stdout,” and will not redirect error messages—which will continue to appear on-screen.

If you want to redirect the “stderr” stream, then you should use the “2>” operator, like so:

Code:

command-line 2> stderr-destination

In fact, the “>” notation, to redirect “stdout,” is a shorthand for “1>”—so, the following are equivalent:

Code:

command-line > stdout-destination

and:

Code:

command-line 1> stdout-destination

To redirect both “stdout” and “stderr,” you can, then, use the following construct:

Code:

command-line 1> stdout-destination 2> stderr-destination

Keep in mind, though, that “stdout-destination” and “stderr-destination” cannot refer to the same file—they must be different.

To capture both “stdout” and “stderr” into a single file, you will have to use the following:

Code:

command-line 1> stdout-destination 2>&1

Part 2: Sending the Output into a “T Fitting.”

Instead of simply redirecting the output to a file, you may “pipe” it through the “tee” command.

Command “piping” is the act of connecting two (or more) commands together with a “pipe” symbol—i.e., the vertical bar, “|”—which will feed the “Standard Output Stream” from the first command, as input to the second one.

The “tee” command will read its input (i.e., its “Standard Input Stream,” or “stdin”), and create two identical copies of it:

One copy will go to a file, the name of which should be specified as an argument to the “tee” command;
Another copy will simply go to the “Standard Output Stream” of the “tee” command—i.e., the terminal window by default.

The command name, “tee,” is a play of words, derived from the world of plumbing, and refers to the conceptual similarity with a “T Fitting.”

As an example, you can see the “debmirror” output scroll by on-screen, while simultaneously copying it to a file named “experiment-1.2,” as follows:

Code:

debmirror --dry-run --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources \ | tee experiment-1.2

Part 3: “Grepping” One Leg of the “T Fitting.”

As described above, the “tee” command will copy its “stdin” stream to a file and, at the same time, to its standard output stream. Instead of letting the output scroll by on-screen, you can “pipe” it through yet another command, to post-process it—e.g., to display only the text lines that you are interested in, and suppress all the rest.

The classic command to search for text strings in a file, is “grep”—which reads text from its standard input stream (or, if specified, from a set of input files), and outputs only those lines that match a given “regular expression.”

A “regular expression” is a particular type of pattern, which describes string formats in an incredibly powerful (albeit awfully cumbersome) notation. A complete description of “regular expressions” can easily fill a complete book volume (and then some), but you can begin to use some of their simpler features without much training.

In the case of the “debmirror” command, for example, you are likely not all too interested in seeing the complete output scroll by on-screen; instead, what you are really looking for, is the line that contains the text “Download all files that we need to get”—you probably won’t mind if the rest of the output gets suppressed.

The “regular expression” to identify any text line that contains the string “Download all files that we need to get” is just the string itself—in a regular expression, each letter, digit, or space represents simply one occurrence of itself. One typical feature that you may want to use, though, is the “start-of-string anchor”: the caret—i.e., “^”—in a regular expression does not match any character, but represents the “start of the string” instead.

Therefore, if you want to search for the text string “Download all files that we need to get” at the start of a line, you would specify the following regular expression:

Code:

^Download all files that we need to get

If you type a regular expression on a command line, you should generally enclose it in single quotes; otherwise, the shell will attempt to interpret many of the special characters that usually appear in the expression in unwanted ways—and, as a result, hopelessly mess up the regular expression before it even gets seen by the command that should process it.

You can now send the “debmirror” output to a file named “experiment-1.3,” while simultaneously suppressing all on-screen output except for the one line that you are interested in, as follows:

Code:

debmirror --dry-run --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources \ | tee experiment-1.3 \ | grep '^Download all files that we need to get'

By the way:

The program name “grep” can be seen as a tribute to the venerable line editor “ed,” which implements a “g/re/p” (i.e., “global regular expression print”) command sequence to print all lines that match a given regular expression. In fact, “ed” was one of the first programs ever that provided an implementation of regular expressions.

Part 4: Eliminating the “T Fitting.”

Chances are, that you don’t actually care for the complete output from the “debmirror” command, and that, therefore, there’s no need to have the “tee” command copy it into a file.

In other words, you may prefer to simply skip the “tee” command, and pipe the output directly on to “grep”—as follows:

Code:

debmirror --dry-run --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources \ | grep '^Download all files that we need to get'

Part 5: Less Is More (more or less).

The “more” command is a potentially handy utility for paging through text, one screenful at a time. However, as its man page readily admits:

Code:

This version is especially primitive. Users should realize that less provides more emulation and extensive enhancements.

Because “more” is rather primitive, a more advanced alternative was developed—which, in a typical play of words, was dubbed “less”—if only to express the idea that “less is more.”

Thus, if you want to page through text, you will most likely prefer the “less” utility over “more”; you could, for example, pipe the “debmirror” output on to the “less” command, as follows:

Code:

debmirror --dry-run --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources \ | less

You will see the first screenful of output, after which the “less” command waits for further instructions from you. The following are a few of the common commands that you can use to navigate through the text:

Hit the space bar to display the next screenful of text;
Hit the <ENTER> key to scroll forward one single line;
Type a lowercase letter “b” to scroll backwards one screenful of text;
Type a “<” sign to jump back to the start of the text;
Type a “>” sign to jump to the end of the text.

Obviously, having to keep hitting the space bar until you reach the one line that you are looking for, wouldn’t be very practical; you will probably want to search for the line instead. The command to initiate a search, is the “/” (forward slash) character, which should be followed by the regular expression that you want to search for. As soon as you subsequently hit the <ENTER> key, the search will actually be performed.

Therefore, if you want to search for the text string “Download all files that we need to get” at the start of a line, you can input the following command to the “less” utility:

Code:

/^Download all files that we need to get

Finally, when you want to quit the “less” utility, simply enter a lowercase “q” command.

Experiment 2: What Files Does a “debmirror” Dry Run Create?

The “debmirror” dry run will have created a number of subdirectories and files in the “~/UbuntuSources” directory. You can list the complete contents of this directory with the following command:

Code:

ls --recursive ~/UbuntuSources

In fact, you will be surprised at the number of objects that the “~/UbuntuSources” directory now contains!

In this experiment, you will find out exactly which files were downloaded into the “~/UbuntuSources” directory tree by the “debmirror” dry run, without having to wade through the output from the “ls” command.

Part 1: Finding all Files Created by the “debmirror” Dry Run.

There exists an incredibly powerful and versatile command to search for file system objects (like files or directories) in a directory hierarchy: the “find” command. As with “regular expressions,” a complete description of the “find” command can easily fill a complete book volume (and then some)—but, again as with “regular expressions,” you can begin to use some of its simpler features without much training.

The first argument that the “find” command needs, is the name of the directory where you want to start the search. For example, if you want to obtain a full list of files and directories that exist in your “~/UbuntuSources” directory tree, you can run the following command:

Code:

find ~/UbuntuSources

Note that, even though you instructed the “find” command to search for file system objects in the “~/UbuntuSources” location, you did not specify what you wanted it to do with any of the objects found. Consequently, the command will simply take its default action—which is to display the path to each of the objects, one per line.

By the way:

If, at this point, you’re curious about the total number of files and directories in your “~/UbuntuSources” directory tree, then you can pipe the output from the “find” command on to the “wc” command, and make it count the number of lines—like so:

Code:

find ~/UbuntuSources | wc --lines

Incidentally, the command name “wc” stands for “word count”—which does not, however, do the command justice, since it will count more than just “words.”

To tell the “find” command that you are interested only in regular files, and that you do not want it to include, e.g., directory entries in its output, you should pass it a “-type” argument, with a parameter value of “f” (for “regular file”)—like so:

Code:

find ~/UbuntuSources -type f

Following is a fragment of the output that may get produced by this command:

Code:

. . . /home/luvr/UbuntuSources/.temp/dists/karmic/Release.gpg /home/luvr/UbuntuSources/.temp/dists/karmic/universe/source/Sources.bz2 /home/luvr/UbuntuSources/.temp/dists/karmic/universe/source/Sources.gz /home/luvr/UbuntuSources/.temp/dists/karmic/universe/source/Release /home/luvr/UbuntuSources/.temp/dists/karmic/universe/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/multiverse/source/Sources.bz2 /home/luvr/UbuntuSources/.temp/dists/karmic/multiverse/source/Sources.gz /home/luvr/UbuntuSources/.temp/dists/karmic/multiverse/source/Release /home/luvr/UbuntuSources/.temp/dists/karmic/multiverse/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/restricted/source/Sources.bz2 /home/luvr/UbuntuSources/.temp/dists/karmic/restricted/source/Sources.gz /home/luvr/UbuntuSources/.temp/dists/karmic/restricted/source/Release /home/luvr/UbuntuSources/.temp/dists/karmic/restricted/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/Release /home/luvr/UbuntuSources/.temp/dists/karmic/main/source/Sources.bz2 /home/luvr/UbuntuSources/.temp/dists/karmic/main/source/Sources.gz /home/luvr/UbuntuSources/.temp/dists/karmic/main/source/Release /home/luvr/UbuntuSources/.temp/dists/karmic/main/source/Sources . . .

You will likely notice that, apparently, just a few unique file names keep showing up in the list—e.g., “Release,” “Sources,” etc. That leads to the next interesting question: exactly which unique file names appear in the output?

Part 2: Finding the Unique File Names in the Directory Tree.

So far, you haven’t given “find” any instructions on how to process the files that it found, so it simply displayed the path to each file—with a newline character appended, to produce a listing with one entry per line.

You can customise the output format with the “-printf” argument, which takes a format string as its parameter. To output just the file name, with any leading directories removed, you can use the “%f” format specification, as follows:

Code:

find ~/UbuntuSources -type f -printf '%f'

Note:

Just as with regular expressions, it is generally a good idea to enclose a format string in single quotes, to prevent the shell from inappropriately processing many of the special characters that may appear in the format string.

Even though the above command really does output just the file names, there is one little detail missing: the entries are not followed by newline characters, so the output appears as one long word on a single line. To ensure that each entry gets properly terminated with a newline character, you should add a “\n” escape sequence to the format string, i.e.:

Code:

find ~/UbuntuSources -type f -printf '%f\n'

Once you can produce a listing of just the file names, you can reduce it to a list of unique names. To that end, you can pass it on to the “sort” command, with the “--unique” option to eliminate duplicates:

Code:

find ~/UbuntuSources -type f -printf '%f\n' | sort --unique

The resulting output will look something like this:

Code:

Release Release.gpg Sources Sources.bz2 Sources.gz wkstw1

That last entry, “wkstw1,” is a file that the “debmirror” command created, with the file name set to the host name of the computer on which the command was run; on your computer, therefore, its name will surely be different. For all practical purposes, you can ignore this file, since it does not really have anything to do with a software archive per se.

The other unique file names are:

“Release”—A high-level description of a software archive, including checksums of some of the other files in the archive.
“Release.gpg”—The digital signature for the “Release” file. If this file is missing, or if the signature cannot be verified or is incorrect, then the software archive cannot be trusted.
“Sources”—A detailed listing of all of the source packages in a software archive.
“Sources.gz”—A “gzip”-compressed copy of the “Sources” file.
“Sources.bz2”—A “bzip2”-compressed copy of the “Sources” file.

By the way:

The “debmirror” program will automatically download and verify digital signatures (i.e., “Release.gpg” files), unless you pass it the “--ignore-release-gpg” option.

By the way:

You can use the “gpgv” tool if you want to verify a digital signature. First, you will obviously have to locate the digital signature (i.e., the “Release.gpg” file) that you want to verify—e.g., using the “find” command:

Code:

find ~/UbuntuSources -name Release.gpg -type f

You will get a list of files, similar to the following:

Code:

/home/luvr/UbuntuSources/.temp/dists/karmic-updates/Release.gpg /home/luvr/UbuntuSources/.temp/dists/karmic-security/Release.gpg /home/luvr/UbuntuSources/.temp/dists/karmic/Release.gpg

To verify, for example, the last signature from this list, you can now run the “gpgv” utility, passing it the location of the signature file (i.e., “Release.gpg”) as its first argument, and the signed data file to which the signature corresponds (i.e., “Release”) as its second argument:

Code:

gpgv ~/UbuntuSources/.temp/dists/karmic/Release.gpg \ ~/UbuntuSources/.temp/dists/karmic/Release

If all goes well, the program will inform you that the signature is valid:

Code:

gpgv: Signature made Wed 28 Oct 2009 15:23:20 CET using DSA key ID 437D05B5 gpgv: Good signature from "Ubuntu Archive Automatic Signing Key <ftpmaster@ubuntu.com>"

Experiment 3: Concatenating the “Sources” Files.

All of the source packages that belong to the software archive, are described in the “Sources” files. You can use the “find” command to list the “Sources” files, as follows:

Code:

find ~/UbuntuSources -name Sources -type f

The resulting file list will look something like this:

Code:

/home/luvr/UbuntuSources/.temp/dists/karmic-updates/universe/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-updates/multiverse/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-updates/restricted/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-updates/main/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-security/universe/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-security/multiverse/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-security/restricted/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic-security/main/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/universe/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/multiverse/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/restricted/source/Sources /home/luvr/UbuntuSources/.temp/dists/karmic/main/source/Sources

Instead of processing each of these files separately, you may prefer to combine them into one stream, and continue to work with that—as documented next.

The traditional command to concatenate files, and send them to “stdout,” is called “cat”; if you simply pass the “cat” command a list of files, then you will see their contents scroll by on-screen, as one long data stream.

Thus, if you want to concatenate, e.g., the “Sources” files (as listed by the “find” command above), then you will have to find a way to generate a command line from their names, preceded by the “cat” command name—i.e.:

Code:

cat first-file second-file ... last-file

Fortunately, there is a command available that is a perfect fit for this job: the “xargs” command. By default, the “xargs” command assumes that its standard input contains a sequence of items, delimited by blanks or newlines. It will construct a command line from:

A command name (possibly followed by a set of initial arguments), passed to it via the command line;
The sequence of items that it reads from standard input.

Once it has built the command line in this way, it will subsequently execute it.

There is, however, one pitfall when you use “xargs” in a command pipeline to process the output from the “find” command: filenames may contain blanks (and even newlines), and these will not properly be parsed. To overcome this issue, you should generally instruct the “find” command to terminate each filename with a null byte (instead of a newline)—i.e., you should pass it a “-print0” argument. You should then inform the “xargs” command that its input uses null bytes (instead of blanks and newlines) as item separators, by passing it a “--null” argument.

You can now have the “find” command list the “Sources” filenames, separated by nulls, and pass the results on to the “xargs” command to concatenate the files and output them to the terminal window, as follows:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat

Keep in mind that the “-print0” argument instructs “find” to separate the filenames with null bytes instead of newlines, while the “--null” argument informs “xargs” that its input contains null bytes as separators. The “cat” argument is the name of the command that “xargs” will specify on the command line that it generates; the remainder of the command line will consist of the filenames that were output by the “find” command.

Note:

Letting the output from the “cat” command scroll by on-screen is, obviously, not particularly useful; you may, therefore, wish to redirect it to a file. However, in the following experiments, I will show you how you can further expand the command pipeline, to eventually turn its output into a useful listing that can help you select exactly which files you want to download, based on, e.g., their sizes.

Note:

If you inadvertently omit the “--null” argument, while the input stream does use null bytes as separators, then the “xargs” command will emit the following warning message to its standard error stream:

Code:

xargs: Warning: a NUL character occurred in the input. It cannot be passed through in the argument list. Did you mean to use the --null option?

You will likely miss this warning, though—unless you redirect the standard output stream.

Note:

The “xargs” command supports a “--verbose” argument—which will cause it to output the generated command line to its standard error stream, before it actually executes the command—e.g.:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null --verbose cat

Again, you will likely miss the stderr output, unless you redirect the standard output stream.

By the way:

If (just for testing purposes) you are interested only in any messages that the “xargs” command may send to the stderr stream, and you don’t really care about the actual output, then you may redirect the stdout stream to “/dev/null”—i.e., the “data sink”:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null --verbose cat > /dev/null

Any data that you send to the “data sink” will, in effect, get discarded.

Experiment 4: Extracting Selected Lines from the “Sources” Files.

All source packages are documented in the “Sources” files according to a template. To understand how this template works, you can take a closer look at one of the package definitions; consider, for example, the entry for the “zsh” package:

Code:

Package: zsh Binary: zsh, zsh-doc, zsh-static, zsh-dev, zsh-dbg Version: 4.3.10-5ubuntu1 Priority: optional Section: shells Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com> Original-Maintainer: Clint Adams <schizo@debian.org> Build-Depends: texinfo, groff-base, libncursesw5-dev, texi2html (>= 1.76-3), libcap2-dev [!hurd-i386 !kfreebsd-i386 !kfreebsd-amd64], bsdmainutils, libpcre3-dev, texlive-latex-base | tetex-bin Architecture: any Standards-Version: 3.8.3 Format: 1.0 Directory: pool/main/z/zsh Files: 64a2bf8be5b5c1d295c7d9e3fe921608 1388 zsh_4.3.10-5ubuntu1.dsc 031efc8c8efb9778ffa8afbcd75f0152 3467439 zsh_4.3.10.orig.tar.gz cf95870f2f56709a7946a0d2506145dc 140918 zsh_4.3.10-5ubuntu1.diff.gz Homepage: http://www.zsh.org/ Vcs-Browser: http://git.debian.org/?p=private/schizo/zsh.git;a=summary Vcs-Git: git://git.debian.org/git/private/schizo/zsh.git Checksums-Sha1: 09772b8414046fc37576155e60d7bdbc348c442b 3467439 zsh_4.3.10.orig.tar.gz 986fe381f06c9b580145398064a207c604267098 140918 zsh_4.3.10-5ubuntu1.diff.gz Checksums-Sha256: ace52518f217d0ed14a121763a550338c26e3c7b0e988d61aa67055a6231691c 3467439 zsh_4.3.10.orig.tar.gz 292545aea5ed73647838c96c1bc288b9811f15c569ffac81595454631050b6f2 140918 zsh_4.3.10-5ubuntu1.diff.gz

If you are interested only in the files that make up the source package, and in their location, then you need consider only the “Directory:” line and the file entries following the “Files:” heading—i.e.:

Code:

Directory: pool/main/z/zsh 64a2bf8be5b5c1d295c7d9e3fe921608 1388 zsh_4.3.10-5ubuntu1.dsc 031efc8c8efb9778ffa8afbcd75f0152 3467439 zsh_4.3.10.orig.tar.gz cf95870f2f56709a7946a0d2506145dc 140918 zsh_4.3.10-5ubuntu1.diff.gz

These lines have the following contents:

The “Directory:” line shows the location of the files, relative to the root of the software archive.

For the online repository, the root of the software archive is defined by the “--method,” “--host” and “--root” parameter values that you specified on the “debmirror” command line—i.e.:

Code:

--method=http --host=archive.ubuntu.com --root=ubuntu

Combining these values with the relative path, as found on the “Directory:” line, results in the following location:

Code:

http://archive.ubuntu.com/ubuntu/pool/main/z/zsh

If you actually navigate to this location in a browser, you will see the expected files listed there (among others).
For your local mirror, the root of the archive is defined by the target directory that you specified on the “debmirror” command line—i.e.:

Code:

~/UbuntuSources

Combining this value with the relative path, gives the following location:

Code:

~/UbuntuSources/pool/main/z/zsh

However, since you performed only a dry run, you will not actually find the files there just yet.

Each file entry following the “Files:” heading, is formatted as follows:

It begins with a space character;
The first data item is the “MD5” checksum of the file, and consists of 32 hexadecimal digits;
This data item is followed by another space character;
The second data item is the size of the file, in bytes;
Next, there is another space character;
The third, and final, data item is the name of the file.

This precise description of the format of these lines will prove to be an invaluable tool to select and process them later on.

Part 1: Selecting the “Directory:” Lines from the Concatenated “Sources” Files.

To select just the “Directory:” lines from the concatenation of the “Sources” files, and discard all of the other lines, you can expand the command pipeline with an invocation of the “grep” command—which you have already encountered. Just tell “grep” that you want to find all lines that begin with the word “Directory,” followed by a colon character (i.e., “:”) and a space—i.e.:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^Directory: '

Part 2: Selecting the File Entries Following the “Files:” Heading.

It won’t come as a surprise that the “grep” tool is perfectly suited to select the file entries as well—i.e., those lines that begin with a space, followed by 32 hexadecimal digits, and another space.

The regular expression to describe these lines consists of the following elements:

The “start-of-string” anchor, followed by a space (i.e., “^ ”)—to indicate that the line must begin with a space;
An expression that describes a sequence of “exactly 32 hexadecimal digits”—as documented below;
Another space—to ensure that the sequence of hexadecimal digits effectively ends after the 32nd digit; without this trailing space, the lines that contain a longer sequence of hexadecimal digits will be selected as well (i.e., the file entry lines following the “Checksums-Sha1:” or “Checksums-Sha256:” headings).

To match a sequence of 32 hexadecimal digits, you will first have to define how to match a single hexadecimal digit, and subsequently specify that such a match must be repeated 32 times.

Part 2-1: Matching a hexadecimal digit.

A hexadecimal digit is either one of the decimal digits, “0” through “9,” or a letter in the range from either “A” through “F” or “a” through “f” (uppercase and lowercase letters are considered equivalent).

The full list of hexadecimal digits, then, is:

0123456789ABCDEFabcdef

In a regular expression, you can generally match “one of the characters in a list” by enclosing the list in square brackets—i.e., “[” and “]”; therefore, the following notation will match one hexadecimal digit:

Code:

[0123456789ABCDEFabcdef]

Within such a selection expression, you may use character ranges—which consist of the first character in the range, a hyphen (i.e., “-”), and the last character in the range. As an example, you may replace the list of decimal digits with the range “0-9”; similarly, you can replace the specified uppercase and lowercase letters with the corresponding ranges. Consequently, the above expression can be rewritten using ranges as follows:

Code:

[0-9A-Fa-f]

As a further matter of convenience, the class of hexadecimal digits occurs frequently enough that a named notation was defined for it: “[:xdigit:]”—which has become the usual way to refer to “a hexadecimal digit” within selection expressions:

Code:

[[:xdigit:]]

Note that this notation requires two pairs of square brackets; the outer pair encloses the selection expression, while the inner pair is an integral part of the character class name.

Part 2-2: Using “Interval Expressions” as Repetition Operators.

Note:

Before you can effectively use “Interval Expressions,” you will have to understand the distinction between “Basic Regular Expressions” and “Extended Regular Expressions.” Various features, including the interval expressions discussed here, are not generally available with Basic Regular Expressions, but require the use of Extended Regular Expressions instead.

Just to understand how interval expressions work, however, you needn’t worry about Basic vs. Extended regular expressions. For now, just rest assured that “grep” really does support the feature—exactly how it implements it, will be explained in due course.

“Extended Regular Expressions” support several repetition operators—the most general of which are “Interval Expressions.” Such expressions are enclosed in curly braces—i.e., “{” and “}”—and can take any of the following forms:

“{n}”—The preceding item is matched exactly “n” times;
“{n,}”—The preceding item is matched at least “n” times;
“{,m}”—The preceding item is matched at most “m” times;
“{n,m}”—The preceding item is matched at least “n,” but at most “m” times.

Part 2-3: Matching a Sequence of Exactly 32 Hexadecimal Digits.

As explained above, a single hexadecimal digit can be matched with the “[[:xdigit:]]” selection expression. Furthermore (assuming that you are using extended regular expressions), you can make any item match exactly 32 times by following it with the “{32}” interval expression. Therefore, a sequence of 32 hexadecimal digits can be matched as follows:

Code:

[[:xdigit:]]{32}

You are not yet ready to use this feature with “grep,” though; first, you will have to learn about “Basic” vs. “Extended” Regular Expressions, and how “grep” implements them.

Part 2-4: A Closer Look at “Extended” Regular Expressions.

Extended Regular Expressions provide a full set of features—including all of the following repetition operators:

Interval Expressions, as described above, which are enclosed in curly braces—i.e., “{” and “}”;
The “+” sign, which is a shorthand for the “{1,}” interval expression—i.e., the preceding item is matched at least once;
The “*” sign, which is a shorthand for the “{0,}” interval expression—i.e., the preceding item is optional, and is matched any number of times;
The “?” sign, which is a shorthand for the “{,1}” interval expression—i.e., the preceding item is optional, and is matched at most once.

In addition to these repetition operators, the following are also supported:

The “|” sign, which is the “alternation operator,” and matches either the item to its left, or the item to its right;
Left and right parentheses—i.e., “(” and “)”—which are used for grouping items together.

If you want to remove the special meaning from any of the operators, and match it with a literal character instead, then you will have to escape it with a backslash (i.e., “\”)—so, for example:

“\{” matches a literal left curly brace—i.e., “{”—in the input string;
“\}” matches a literal right curly brace—i.e., “}”—in the input string;
“\+” matches a literal plus sign—i.e., “+”—in the input string;
etc.

Part 2-5: A Closer Look at “Basic” Regular Expressions.

Of the operators listed above, Basic Regular Expressions support only the “*”—which matches the preceding item zero or more times. As a consequence, if you want to match a literal asterisk, you will have to escape it with a backslash (i.e., “\*”).

The other operators lose their special meanings, and will be processed as literal characters instead—just like letters, digits, and spaces. In other words, each of the following characters will simply match a single occurrence of itself:

Code:

{ } + ? | ( )

The meaning of an escape sequence that consists of a backslash (i.e., “\”) followed by one of these characters, is left undefined. Any program that supports basic regular expressions, should come with documentation that explains exactly how it processes such escape sequences.

For example, some programs may simply ignore the backslash, and process the sequence as a literal match with the character following the backslash—so, e.g., “\?” will match a single question mark.

Other programs may process both the backslash, and the character following it, as literal matches—so, e.g., “\?” will match a single backslash, followed by a question mark.

In fact, any other interpretation is equally valid—as long as it is properly documented.

Part 2-6: A Closer Look at “grep” Regular Expressions.

The “grep” utility, by default, supports Basic Regular Expressions. However, “GNU grep” (which is distributed with Linux systems) makes particularly brilliant use of the escape sequences that are left undefined: any such escape sequence will activate the special meaning of the escaped character, as defined for Extended Regular Expressions.

For example, consider the following regular expression:

Code:

[[:xdigit:]]{32}

By default, “grep” will process this as a basic regular expression—it will, therefore, match one hexadecimal digit, followed by a left curly brace (i.e., “{”), the decimal digits “3” and “2,” and finally, a right curly brace (i.e., “}”).

If you want to activate the special meanings of the curly braces, then (assuming that you are using “GNU grep”) you should escape them with a backslash—like so:

Code:

[[:xdigit:]]\{32\}

GNU “grep” will interpret this as matching a sequence of 32 hexadecimal digits.

Part 2-7: Finally! The Expanded Command Pipeline to Select the File Entries.

Unless you are hopelessly confused by now, you are finally ready to build the pipeline to select the file entries from the concatenation of the “Sources” files.

Remember that these lines are described by a regular expression that consists of the following elements:

The “start-of-string” anchor, followed by a space (i.e., “^ ”)—to indicate that the line must begin with a space;
An expression that describes a sequence of “exactly 32 hexadecimal digits”—as documented above;
Another space, to ensure that the sequence of hexadecimal digits effectively ends after the 32nd digit.

The resulting command line, then, looks like this:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^ [[:xdigit:]]\{32\} '

Part 2-8: Not Yet Confused? You Will Be After This Episode of... “grep”!

As documented above, “grep” supports Basic Regular Expressions by default—even though “GNU grep” does provide a mechanism to use all of the features that are defined for Extended Regular Expressions.

However, as an alternative, “grep” supports a command-line option to modify its behaviour and make it support Extended Regular Expressions instead: the “--extended-regexp” option.

Therefore, the following command makes use of Extended Regular Expressions to select the detail data lines:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep --extended-regexp '^ [[:xdigit:]]{32} '

Whether you prefer to use GNU “grep” with or without the “--extended-regexp” option, is entirely up to you, since the exact same features are supported both ways—but do make sure that you get the backslash escapes right!

Part 3: Selecting Both “Directory:” Lines and File Entries from the “Sources” Files.

To select both the “Directory:” lines and the file entries in one go, you will need the “alternation operator” (i.e., “|”)—which selects any input string that matches either the item to the left of the operator, or the item to its right.

The following expression, for example, will match both types of lines that were discussed above:

Code:

'^Directory: |^ [[:xdigit:]]{32} '

In effect, this expression identifies the lines that either:

Begin with the word “Directory” followed by a colon (i.e., “:”) and a space,

or:

Begin with a space followed by a sequence of 32 hexadecimal digits and another space.

Since the start-of-string anchor is common to both subexpressions, you may move it to the front, and use grouping (i.e., parentheses) to indicate where exactly the alternation should begin and end—as follows:

Code:

'^(Directory: | [[:xdigit:]]{32} )'

This expression identifies the lines that begin with either:

The word “Directory” followed by a colon (i.e., “:”) and a space,

or:

A space followed by a sequence of 32 hexadecimal digits and another space.

Similarly, both subexpressions end with a space—which you may move to the end, after the right parenthesis.

Keep in mind that both alternation and grouping are features of Extended Regular Expressions; consequently, if you want to use them with GNU “grep,” you will either have to use the “--extended-regexp” command-line option, or escape these operators with a backslash.

Armed with this knowledge, you can now adapt the command pipeline as follows, to make it select both the “Directory:” lines and the file entries that follow the “Files:” heading:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) '

Note:

The command pipeline, as shown here, does not make use of the “--extended-regexp” command-line option. If you wish, you can adapt it to run with the option instead—the results should be identical.

Experiment 5: Reformatting the Extracted Lines to Make them Easier to Handle.

While the lines that you have just extracted from the concatenated “Sources” files, contain all the information that you need to identify all files that are present in the software archive (including their locations), they are, unfortunately, not formatted in a particularly handy way.

Take, for example, another look at the lines that describe the source files of the “zsh” package:

Code:

Directory: pool/main/z/zsh 64a2bf8be5b5c1d295c7d9e3fe921608 1388 zsh_4.3.10-5ubuntu1.dsc 031efc8c8efb9778ffa8afbcd75f0152 3467439 zsh_4.3.10.orig.tar.gz cf95870f2f56709a7946a0d2506145dc 140918 zsh_4.3.10-5ubuntu1.diff.gz

There are two main issues that make this list somewhat impractical to use:

To fully identify any of the files, you need to consider not just one, but two lines instead:

The actual file entry shows the name of the file, its size, and its checksum value;
The preceding “Directory:” heading shows the location of the file (relative to the root of the software archive).
The list would be easier on the human eye if it were in a fixed-width columnar format instead.

Thus, the list would be far more manageable if you could reformat it akin to the following:

Code:

1388 64a2bf8be5b5c1d295c7d9e3fe921608 pool/main/z/zsh/zsh_4.3.10-5ubuntu1.dsc 3467439 031efc8c8efb9778ffa8afbcd75f0152 pool/main/z/zsh/zsh_4.3.10.orig.tar.gz 140918 cf95870f2f56709a7946a0d2506145dc pool/main/z/zsh/zsh_4.3.10-5ubuntu1.diff.gz

These lines consist of the following elements:

A fixed-width, 15-position field for the file size;
A space character, to separate the first and second fields;
The checksum value—which is a 32-character field;
Another space character, to separate the second and third fields;
The path to the file, relative to the root of the software archive.
Note that this field is composed of:

The directory in which the file is located—as extracted from the “Directory:” heading;
A forward slash (i.e., “/”), inserted between the directory and the file name;
The file name—as it occurs on the original file entry.

As you may have come to expect by now, there exists a great, and surprisingly powerful, tool that allows you to transform the file list in exactly this way.

That tool is called “awk”—and is introduced next.

By the way:

The program name “awk” is derived from the names of its original authors: Alfred V. (“Vaino”) Aho, Peter J. (“Jay”) Weinberger, and Brian W. (“Wilson”) Kernighan.

Part 1: A First Look at “awk.”

The primary goal of “awk” is, to process text files, line by line, and perform actions on them, according to a set of instructions that you supply.

The instructions that “awk” understands, generally consist of two parts:

An expression to select the lines of text to which the actions must be applied.
This expression can take several forms—if, for instance, you want to select all lines that match a regular expression, then you should enclose that regular expression in forward slashes (i.e., “/”).
A list of actions that must be performed on the selected lines.
This action list should be enclosed in curly braces (i.e., “{” and “}”).

To make “awk” take action on, for example, the “Directory:” lines (as discussed above), you can use the following notation:

Code:

/^Directory: /

This selection expression should be followed with the list of actions that must be performed on the selected lines.

One very simple action is just “print”—which prints the selected line, like so:

Code:

/^Directory: / { print }

Part 1-1: Selecting and Printing the “Directory:” Lines.

You can now expand the command pipeline, to make “awk” select and print the “Directory:” lines (and ignore the file entries) like this:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) ' \ | awk '/^Directory: / { print }'

Note:

Admittedly, the “grep” command in this pipeline is redundant, since you could pass the concatenated input files directly on to “awk” instead, and let that ignore all but the “Directory:” lines. Still, leaving the “grep” command in for now, will simplify the next few steps.

Part 1-2: Printing a Field from a Selected Line.

Clearly, if you simply wanted to print the selected lines, unchanged, then you wouldn’t need “awk”—since “grep” can take perfect care of this task.

However, whenever “awk” reads an input line, it will scan it and extract its fields. By default, fields are delimited by white space (i.e., spaces and tabs); consequently, the “Directory:” lines will have two fields:

The literal string “Directory:”;
The (relative) path to the directory.

To refer to a field, you use the field operator—which is a dollar sign (i.e., “$”)—followed by the position of the field by number; thus, “$1” identifies the first field, “$2” is the second field, and so on.

To print just one field, instead of the whole line, you can specify the field on the “print” command. For example, the directory path is the second field on a “Directory:” line. Therefore, you can produce a listing of the directory paths that occur in the concatenated input files, as follows:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) ' \ | awk '/^Directory: / { print $2 }'

Part 1-3: Saving a Field into a Variable.

Instead of directly specifying the field on the “print” command, you can save its value into a variable, and subsequently use that—e.g., using “dirname” as the variable name:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) ' \ | awk '/^Directory: / { dirname = $2 ; print dirname }'

Part 1-4: Combining and Printing Values from Multiple Lines.

Next, you may expand the “awk” command with code to select and process the file entry lines as well. Thanks to the line filtering done by the “grep” command, you can keep the regular expression for the file entry lines pretty simple—just select any lines that begin with a space.

The action could be as simple as printing the (relative) path to the file—i.e., the concatenation of:

The “dirname” value, saved from the preceding “Directory:” line;
A literal forward slash character;
The file name—i.e., the third field from the input line.

It is, then, no longer necessary (or desirable) to print the “dirname” value while processing the “Directory:” line.

The command line may, therefore, be updated as follows:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) ' \ | awk '/^Directory: / { dirname = $2 } /^ / { print dirname "/" $3 }'

Part 1-5: Getting the “awk” Instructions From a File.

So far, you entered the “awk” instructions into a string, delimited by single quotes, directly on the command line. As the instruction list grows, however, this may easily become impractical.

Therefore, it is far more common, and convenient, to save the instructions into a file, and supply that file—instead of the command line—as the source from which the instructions are to be read.

Consider, for example, the following code:

Code:

/^Directory: / { dirname = $2 } /^ / { print dirname "/" $3 }

If you save this code to a file named, e.g., “transform_source_index,” then you can specify this file name as the parameter value on a “-f” argument, in order to get “awk” to read its instructions from the file—i.e.:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) ' \ | awk -f transform_source_index

Part 1-6: Making the “awk” Script File Executable.

Once you save the “awk” instructions to a file, you can inform your system that the file contains executable code, and that “awk” is the program that should execute it.

First, you will have to set the “executable” flag on the file, using the “chmod” command, like this:

Code:

chmod +x transform_source_index

To verify that the file really is executable now, you can subsequently list its directory entry, in long format, as follows:

Code:

ls -l transform_source_index

The first field of the output line should, then, show three “x” flags—e.g.:

Code:

-rwxr-xr-x

Next, you will have to identify “awk” as the program to execute the file. To this end, you will have to add a so-called “shebang” line to the script.

The “shebang” line must be the first line of the file, and must have the following format:

It must begin with a “hash” sign (i.e., “#”) and an “exclamation mark” (i.e., “!”);
The remainder of the line must specify the full path to the executing program—i.e., in this case, to “awk”—followed by a single command-line option flag, if required.

Thus, to compose the “shebang” line, you will have to determine the full path to the “awk” tool—using, e.g., the “which” command:

Code:

which awk

The result of this command should be similar to “/usr/bin/awk”; then, to complete the “shebang” line, you will have to append the “-f” option flag to it—in order to make “awk” read its instructions from the file.

The updated version of the “transform_source_index” script, including the “shebang” line, will, then, look something like this:

Code:

#!/usr/bin/awk -f /^Directory: / { dirname = $2 } /^ / { print dirname "/" $3 }

From now on, you can simply execute the “transform_source_index” file, just like any ordinary program. If, for example, the script file is in your current directory, then you can adapt the command pipeline as follows:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | grep '^\(Directory:\| [[:xdigit:]]\{32\}\) ' \ | ./transform_source_index

Thus, to execute the “transform_source_index” file, you no longer even have to be aware that it is an “awk” script!

Note:

The term “shebang” refers to the first two characters of the line: the “hash” and the exclamation point—which, in Unix jargon, is commonly called the “bang.”

Part 2: Eliminating the “grep” Command from the Pipeline.

The “awk” script, “transform_source_index,” is supposed to process two types of lines from its input:

The “Directory:” lines;
The file entry lines.

So far, you invoked the “grep” utility to select these lines, and to make sure that all other lines be removed from the input stream that gets passed to “awk.”

You will not be able to eliminate “grep” from the command pipeline until you learn how to instruct “awk” to select the file entries—i.e., the lines that begin with a space, followed by exactly 32 hexadecimal digits, and another space. This operation is complicated by the observation that “awk” does not support interval expressions.

Part 2-1: A Closer Look at “awk” Regular Expressions.

Traditionally, “awk” claims to support “extended regular expressions”—but, equally traditionally, its view of what constitutes an “extended regular expression” does not include:

Character classes—such as “[:xdigit:]”—within selection expressions;
Interval expressions.

Neither of these features are supported by “awk.”

The lack of support for named character classes may be unfortunate, but is not particularly critical—you can simply replace the class name with the list of individual characters (or the ranges of characters) that belong to the class. As an example, instead of the “[:xdigit:]” class, you could use the “0-9A-Fa-f” construct.

It is much harder, though, to come up with a good alternative for interval expressions; if you cannot use these, but you still want to express “exactly 32 occurrences” of an item, then you will have to explicitly code the item 32 times in a row—e.g.:

Code:

[0-9A-Fa-f][0-9A-Fa-f]...[0-9A-Fa-f][0-9A-Fa-f]

(where the ellipsis—“...”—represents a sequence of yet another 28 “[0-9A-Fa-f]” selectors).

It should be immediately obvious that a better alternative is sorely needed.

Fortunately, “awk” provides a built-in function to test the length of a character string. You can, therefore, select and process the file entry lines as follows:

Select the lines that begin with a space, followed by one or more hexadecimal digits, and another space;
Actually process the line only if the length of the checksum string (i.e., the first field of the line) is 32.

The new version of the “transform_source_index” script will, then, look something like this:

Code:

#!/usr/bin/awk -f /^Directory: / { dirname = $2 } /^ [0-9A-Fa-f]+ / { if ( length ( $1 ) == 32 ) print dirname "/" $3 }

Important:

In “awk” (just as in the C programming language), a single “=” sign represents the assignment operator, and a double “=” sign—i.e., “==”—is required for the equality testing operator.

This version of the script will select only the desired input lines, and ignore the rest. Consequently, it allows you to remove “grep” from the command pipeline—like so:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | ./transform_source_index

Part 2-2: {Optionally} Breaking with the Tradition: GNU “awk.”

As documented above, “awk” traditionally lacks support for character classes and for interval expressions. GNU “awk” (a.k.a. “gawk”), however, does not strictly adhere to this tradition, and provides a set of options to control how it interprets regular expressions:

By default, GNU “awk” does support character classes, but not interval expressions.
In addition, GNU “awk” defines some non-standard operators, which it also supports in its default mode of operation (but which will not be discussed here).
With the “--posix” command-line option, GNU “awk” supports the full set of POSIX standard features—including both character classes and interval expressions.
Its non-standard operators, however, are not supported in this mode.
With the “--traditional” command-line option, GNU “awk” supports only traditional “awk” regular expressions.
Consequently, character classes and interval expressions, as well as its non-standard operators, will be disabled.
With the “--re-interval” command-line option, GNU “awk” adds interval expressions to its set of supported features.
In other words, all of its operators—including character classes, interval expressions, and the non-standard extensions, will be enabled.
If you specify both the “--traditional” and the “--re-interval” options, then GNU “awk” will support interval expressions in addition to the traditional “awk” operators.
Both character classes and its non-standard operators, however, will be disabled.

Part 2-3: Which “awk” Implementation Does Ubuntu Use By Default?

To find out which “awk” versions are installed on your system, and which of the installed versions is currently in use, you can use the “update-alternatives” command, as follows:

Code:

update-alternatives --query awk

Here’s an example of what this command may report:

Code:

Link: awk Status: auto Best: /usr/bin/mawk Value: /usr/bin/mawk Alternative: /usr/bin/mawk Priority: 5 Slaves: awk.1.gz /usr/share/man/man1/mawk.1.gz nawk /usr/bin/mawk nawk.1.gz /usr/share/man/man1/mawk.1.gz

From this output, we conclude that the system uses “mawk”—which is an independent, high-performance “awk” implementation done by Mike Brennan, but which faithfully implements only the traditional type of “awk” regular expressions.

Note:

To obtain authoritative confirmation about the features that your locally installed “awk” utility does or does not support, you should query its “man” page (i.e., its manual in electronic format), like so:

Code:

man awk

This will bring up the local “awk” manual, and invoke the “less” pager program (which you encountered earlier on) to display it.

You will be presented with, e.g., the “mawk” manual (if that’s the “awk” implementation that your system uses); the “Regular expressions” section of the manual explains which types of regular expressions are supported by the program.

Part 2-4: Installing GNU “awk” on Your Ubuntu System.

If you want to install the GNU “awk” implementation, “gawk,” on your system, then you can simply run the following command:

Code:

sudo apt-get install gawk

If you subsequently query the list of “awk” implementations that are installed on your system, then the first few lines of output should look like this:

Code:

Link: awk Status: auto Best: /usr/bin/gawk Value: /usr/bin/gawk

The “awk” man page will now document the features of the “gawk” implementation; its “Regular Expressions” section has the following to say about “interval expressions”:

Code:

r{n} r{n,} r{n,m} One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regular expression r is repeated n times. If there are two numbers separated by a comma, r is repeated n to m times. If there is one number followed by a comma, then r is repeated at least n times. Interval expressions are only available if either --posix or --re-interval is specified on the command line.

This confirms that your system is now using an “awk” implementation that does indeed support interval expressions—subject to the restrictions noted.

Note:

If your system continues to use the “mawk” implementation after you install “gawk” (or, alternatively, if you want to switch your system back to “mawk” instead), then you can invoke the “update-alternatives” command as follows:

Code:

sudo update-alternatives --config awk

You will be presented with a list of “awk” alternatives that are available on your system. Either select one of the alternatives by number, or just hit the <ENTER> key to keep the currently active option.

Part 2-5: Updating the “awk” Script to Use a Character Class.

Assuming that your system is now configured to use “gawk” (instead of “mawk”), you can begin to use character classes without further ado. Thus, you can update the “transform_source_index” script to use the “[:xdigit:]” character class, like this:

Code:

#!/usr/bin/awk -f /^Directory: / { dirname = $2 } /^ [[:xdigit:]]+ / { if ( length ( $1 ) == 32 ) print dirname "/" $3 }

No changes whatsoever are required to the command pipeline that runs the script:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | ./transform_source_index

Part 2-6: Updating the “awk” Script to Use an Interval Expression.

Even though “gawk” does provide support for interval expressions, this feature remains disabled by default. Indeed, the unchanged command pipeline will produce no output whatsoever if you simply update the “transform_source_index” script to use an interval expression for the file entry lines—like this:

Code:

#!/usr/bin/awk -f /^Directory: / { dirname = $2 } /^ [[:xdigit:]]{32} / { print dirname "/" $3 }

To make this script work, you will have to force “gawk” to enable interval expressions, using either the “--posix” or the “--re-interval” command-line option.

You may be tempted to add either of these options to the “shebang” line—e.g.:

Code:

#!/usr/bin/awk --posix -f

Unfortunately, even though this looks like a great idea, it won’t work; if you make this modification to the “transform_source_index” script, and subsequently attempt to rerun the command pipeline unchanged, then “gawk” will display its usage information, to tell you that it couldn’t make sense of the arguments that you supplied:

Code:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ... Usage: awk [POSIX or GNU style options] [--] 'program' file ... . . .

Apparently, the system won’t properly parse the “shebang” line. This behaviour isn’t even really a bug, since it is documented in the “bash” man page; indeed, the “COMMAND EXECUTION” section of the manual has this to say on the subject:

Code:

If the program is a file beginning with #!, the remainder of the first line specifies an interpreter for the program. The shell executes the specified interpreter on operating systems that do not handle this executable format themselves. The arguments to the interpreter consist of a single optional argument following the interpreter name on the first line of the program, followed by the name of the program, followed by the command arguments, if any.

If you have trouble understanding this (somewhat dense) paragraph, then here’s an explanation of what it seems to be trying to say:

You attempt to execute a program, say, by the name of “./transform_source_index”;
That program is a file that begins with “#!”;
Therefore, the remainder of its first line (i.e., “/usr/bin/awk --posix -f”) specifies an interpreter for the program file;
The operating system may execute the program on behalf of the shell, but if it doesn’t support this operation, then the shell will take over from here;
The name of the interpreter, as specified on the first line of the program (i.e., “/usr/bin/awk”), may be followed by a single argument (i.e., “--posix -f”);
The shell will execute the interpreter, “/usr/bin/awk,” with the following arguments:

The single argument that follows the name of the interpreter on the first line of the program—i.e., “--posix -f”;
The name of the program—i.e., “./transform_source_index”;
Any command-line arguments that you may have supplied on the command line—i.e., none, in this case.

In other words, the “/usr/bin/awk” interpreter will be passed the string “--posix -f” as a single argument—which it doesn’t understand.

Thus, if you want to pass the “--posix” option to the script, you should not add the option to the “shebang” line, but supply it on the command line instead. You will, therefore, have to modify your command pipeline as follows:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | ./transform_source_index --posix

Note:

If you find it highly unfortunate that you now have to supply the “--posix” command-line option whenever you run the “transform_source_index” script, then I must agree. After all, the “shebang” line provides a mechanism that should allow you to hide the specifics of how the script is to be executed—but now, there’s this silly little detail that keeps creeping up to bite you, if you don’t pay attention.

Note:

The shell supports a “POSIXLY_CORRECT” environment variable, which causes some utilities—including “gawk”—to behave strictly according to POSIX standards. You can run the following command to set the environment variable:

Code:

export POSIXLY_CORRECT=1

If you subsequently rerun the command pipeline—even without the “--posix” option—in the current shell session, then the “gawk” program will behave in the same way as with the option.

You should keep in mind, though, that the “POSIXLY_CORRECT” environment variable may subtly modify the behaviour of various programs in possibly unexpected ways.

By the way:

If you’re curious, the Linux kernel processes the “shebang” line in its “load_script()” function—the source code of which is located in the “linux/fs/binfmt_script.c” file.

Part 3: Creating the Reformatted Output File.

First, to recapitulate, your “transform_source_index” script currently looks something like this:

Code:

#!/usr/bin/awk -f /^Directory: / { dirname = $2 } /^ [[:xdigit:]]{32} / { print dirname "/" $3 }

The command pipeline to execute the script, will look like the following:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | ./transform_source_index --posix

The result will be a list of files in the Ubuntu source package archive—where, for each file, the path relative to the root of the software archive will be shown.

It is now time to enhance the script, to make it create output lines according to, e.g., the following format:

A fixed-width, 15-position field for the file size;
A space character, to separate the first and second fields;
The checksum value—which is a 32-character field;
Another space character, to separate the second and third fields;
The path to the file, relative to the root of the software archive.

Part 3-1: Creating Formatted Output with the “printf” Statement.

To create formatted output, “awk” supports the “printf” statement—which expects the following parameters:

A “format string” that specifies what the output should look like.
The “format string” consists of literal text—which will be output unchanged—intermixed with “format specifiers”—which will be replaced with data values, according to a given set of rules.
A list of data values that will be substituted into the output, according to the “format specifiers” that occur in the “format string.”

Additionally, keep in mind that the “printf” statement will not automatically append a newline character to the output string; to produce a newline, you will explicitly have to code a “\n” escape sequence.

In the “format string,” the percent sign (i.e., “%”) will be used as an introducer to a “format specifier”; the percent sign should be followed by a data type specifier—e.g., “d” for a decimal number, or “s” for a character string.

For example, to report the size of a file, as it appears on a file entry line as discussed above, you may code the following statement:

Code:

printf "File size: %d.\n" , $2

This statement will output:

The literal string “File size: ”;
A decimal number—which takes the place of the “%d” format specifier;
A period (i.e., “.”);
A newline.

The value of the decimal number that gets substituted into the output line, will be taken from the first (and only) value supplied after the “format string”—i.e., the second field from the input line.

If you want to output the number into a fixed-width field, then you should insert the desired field width in between the “%” sign and the “d” type specifier; for example, the following statement will output the file size, right-aligned, in a fixed-width, 15-position field:

Code:

printf "File size: %15d.\n" , $2

Next, if you want to report not only the file size, but also its checksum value, then you could expand the “printf” statement like this:

Code:

printf "File size: %15d; checksum: %s.\n" , $2 , $1

Note that this time, the “format string” includes two format specifiers—one for a decimal number, and one for a character string—and, consequently, two data values are required following the “format string.” The first value (i.e., “$2”—the second input field) will be treated as a decimal number, while the second value (i.e., “$1”—the first input field) will be interpreted as a character string.

Part 3-2: Creating the Final Output Format.

The final output format includes: the file size (in a fixed-width, 15-position field); the checksum value; and the path to the file (relative to the root of the software archive). You could, then, code the “printf” statement like this:

Code:

printf "%15d %s %s\n" , $2 , $1 , dirname "/" $3

Note that the third data value is actually a concatenation of three items: the dirname value; a forward slash; and the third input field (i.e., the file name). You may, therefore, want to treat the dirname and the file name as two separate data items, and join them together with the forward slash in the format string—like so:

Code:

printf "%15d %s %s/%s\n" , $2 , $1 , dirname , $3

Notice how the format string now includes four format specifiers, and that, as a consequence, the “printf” statement requires four data values for substitution into the output.

With these modifications, here is what the “transform_source_index” script comes to look like:

Code:

#!/usr/bin/awk -f /^Directory: / { dirname = $2 } /^ [[:xdigit:]]{32} / { printf "%15d %s %s/%s\n" , $2 , $1 , dirname , $3 }

Part 3-3: Creating the Master File List.

Now that the file list is in the appropriate format, it is time for the final step in creating the “master file list”: sorting the file list by, e.g., descending file size.

To ensure that the output will be sorted correctly, according to the numeric value of the file size, you should force a numeric sort, with the sort key in columns 1 through 15.

Furthermore, you should save the final output into a file—e.g., “master_filelist”—instead of letting it scroll by on-screen.

Following, then, is the updated, and final, command pipeline:

Code:

find ~/UbuntuSources -name Sources -type f -print0 | xargs --null cat \ | ./transform_source_index --posix \ | sort --key=1,15 --numeric-sort --reverse --unique > master_filelist

Note:

If you’re curious about the biggest files in the software archive, then you can use the “head” command to view the first ten lines of the “master_filelist” as follows:

Code:

head master_filelist

To modify the number of lines that you wish to see, you can use the “--lines” option—e.g., to view 20, instead of 10, lines:

Code:

head --lines 20 master_filelist

Similarly, if you’re curious about the smallest files in the archive, then you can use the “tail” command:

Code:

tail master_filelist

The “tail” command also supports the “--lines” option—e.g.:

Code:

tail --lines 20 master_filelist

Experiment 6: Calculating the Total Download Size of the Software Archive.

The “master_filelist” now lists the contents of the (online) software archive; each line identifies one data file, and includes the size of the file. Consequently, you can calculate the total download size of the software archive by simply adding together the file sizes found in the “master_filelist”—as illustrated by the following piece of “pseudo code”:

Code:

Initialise the “total_bytes” variable to zero. For each input line from the “master_filelist”: Add the “file size” field to the “total_bytes” variable. End For. The total download size, in bytes, is now present in the “total_bytes” variable. You can divide it by 1024*1024 if you want to report it in MiBs instead of bytes.

The translation of this approach to “awk” is fairly straightforward:

You do not explicitly have to initialise the “total_bytes” variable to zero; “awk” will automatically do that for you.
If you still want to perform the initialisation yourself anyway, then you will have to write an action list that “awk” should run just once, at the very beginning of the run (even before it reads the first line from its input file). To this end, “awk” supports a special selection expression that consists simply of the word “BEGIN”—like so:

Code:

BEGIN { total_bytes = 0 }

The main action—i.e., adding the “file size” field to the “total_bytes” variable—must be executed for every input line. In other words, you will have to write an action list that “awk” will run for every input line. To this end, you can simply omit the selection expression. Therefore, since the “file size” is the first field of the input line, you end up with something like the following:

Code:

{ total_bytes = total_bytes + $1 }

In fact, this operation is commonly abbreviated as follows:

Code:

{ total_bytes += $1 }

If this expression confuses you, then you may want to read it as: “add the first input field to the total_bytes variable”; hopefully, then, this wording helps clear up any confusion that may have arisen.

Finally, after all input lines are processed, you will want to print the calculated value. In other words, you want to create an action list that “awk” should run just once, at the very end of the run (even after all lines from the input file are processed). To this end, “awk” supports a special selection expression that consists simply of the word “END”—like so:

Code:

END { print total_bytes / (1024 * 1024) " MiB." }

Putting all these elements together (and omitting the explicit initialisation of the “total_bytes” variable), you arrive at a command line similar to the following:

Code:

awk '{ total_bytes += $1 } END { print total_bytes / (1024 * 1024) " MiB." }' < master_filelist

If you run this command line, then you will see the following output:

Code:

27502.5 MiB.

This is the total download size of all data files that are present in the software archive.

Note:

You may remember the download size reported, earlier on, by the “debmirror” dry run:

Code:

Download all files that we need to get (28061 MiB).

This size includes not only the data files (which are listed in your “master_filelist”), but also the control files (such as “Release,” “Sources,” etc.).

Experiment 7: Selecting a Subset of Files for an Initial Download.

Depending on your internet connectivity—and on your patience—you may decide that downloading some 27 GiBs of data in one go is a bit too much of a good thing. You may, therefore, want to select a subset of the files for an initial download. In this experiment, you will study various strategies to build such a subset, and you will calculate the total download size of each subset, to help you determine which one suits you best.

Part 1: Subsetting the File List Based on a Regular Expression.

To select a subset of the files listed in the “master_filelist,” you may want to use the path and file name as the criterion. For example, if you want to select all input lines that contain the string “language-pack,” then you can code the following selection expression in “awk”:

Code:

/language-pack/

For each selected line, you may want to take the following actions:

Print the path and file name—i.e., the third input field;
Add the file size—i.e., the first input field—to the total download size of the selected files.

Finally, at the very end of the run, you can print the calculated download size.

The “awk” script to take these actions will look something like this:

Code:

#!/usr/bin/awk -f /language-pack/ { print $3 ; download_size += $1 } END { print download_size / (1024 * 1024) " MiB." }

If you save this code to a file named, e.g., “subset_filelist,” and make this file executable, then you can subsequently run it as follows:

Code:

./subset_filelist < master_filelist

You will see the list of selected files, followed by their total download size:

Code:

pool/main/l/language-pack-gnome-fr-base/language-pack-gnome-fr-base_9.10+20091022.tar.gz pool/main/l/language-pack-kde-pt-base/language-pack-kde-pt-base_9.10+20091022.tar.gz pool/main/l/language-pack-gnome-pt-base/language-pack-gnome-pt-base_9.10+20091022.tar.gz . . . pool/main/l/language-pack-gnome-bo-base/language-pack-gnome-bo-base_9.10+20091022.tar.gz pool/main/l/language-pack-gnome-lg-base/language-pack-gnome-lg-base_9.04+20090413.tar.gz pool/universe/s/sword-language-packs/sword-language-packs_0.3ubuntu1.tar.gz 603.068 MiB.

As an added convenience, you may want the script to inform you not only about the total download size, but also about the number of files selected. To that end, you may keep track of a counter variable that you increment for each selected file. The shorthand notation to “add 1 to a variable” uses the “++” operator—which results in the following updated “awk” script:

Code:

#!/usr/bin/awk -f /language-pack/ { print $3 ; download_size += $1 ; number_of_files ++ } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." }

If you execute this version of the script, then the last line of output will look like this:

Code:

603.068 MiB in 398 files.

There’s one little detail about this script that might be improved: If you want to redirect the list of selected files to an output file, then the last line will get written to that file as well. You will likely prefer this line to go elsewhere, though.

By default, “awk” will send its output to the “Standard Output Stream.” If you want any output to be sent anywhere else, then you will have to use the “>” “output redirection” operator—which must be followed by the name of the output file. Furthermore, in addition to actual file names, “awk” recognises a number of special data streams—most notably, “/dev/stderr” to refer to the “Standard Error Stream.”

Thus, you may send the final output line to the “Standard Error Stream” as follows:

Code:

#!/usr/bin/awk -f /language-pack/ { print $3 ; download_size += $1 ; number_of_files ++ } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." > "/dev/stderr" }

If you subsequently run this script, then you may use output redirection to send the list of selected files to an output file—e.g.:

Code:

./subset_filelist < master_filelist > download_filelist

The “download_filelist” will now list the selected files, while the final output line will continue to appear on-screen instead.

Note:

Keep in mind that the “awk” output redirection operator is different from output redirection in the shell. The first time that “awk” processes its “>” operator to send output to a given file, it will automatically open the file for you—if the file does not yet exist, then it will be created, but if it does exist, then its contents will be deleted. Any subsequent output operations on the file will simply append data to the file.

Also keep in mind that the “/dev/stderr” name is unique to “awk”; it does not refer to any specific physical file or device available on the system.

Part 2: Subsetting the File List by Line Number.

If you open the “master_filelist” in a text editor, then you may decide that you want to select a block of lines by line number. To this end, “awk” supports a special form of selection expression, like this:

Code:

NR==fromline,NR==toline

In this expression, “fromline” represents the line number of the first line that you want to select from the input file, and “toline” identifies the last line to be selected.

Important:

Remember that the comparison operator for equality requires a double “equals” sign—i.e., “==”—in “awk,” and that a single “equals” sign is reserved for assignment instead. It is an incredibly common mistake, even among fairly experienced “awk” users, to overlook this critical distinction. (In fact, I got bitten by it while I was preparing this document...)

So, to select, say, line numbers 3330 through 3399 from the “master_filelist,” you could modify the “subset_filelist” script as follows:

Code:

#!/usr/bin/awk -f NR==3330,NR==3399 { print $3 ; download_size += $1 ; number_of_files ++ } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." > "/dev/stderr" }

If you execute this script, then the final output line will look like the following:

Code:

65.7176 MiB in 70 files.

Part 3: Subsetting the File List by File Size.

You may want to select files based on their sizes—e.g., all files that are greater than 100 KiB, but smaller than 200 KiB. Obviously, since the file list is sorted by (descending) file size, you could open the “master_filelist” in an editor, and look for the block of files that satisfy this condition; once you found them, you could then select them by line number, as documented above.

Alternatively, you could simply test the file size (i.e., the first input field), and select any files for which this value is greater than 100 KiB and smaller than 200 KiB:

Code:

#!/usr/bin/awk -f ($1 > 100 * 1024) && ($1 < 200 * 1024) { print $3 ; download_size += $1 ; number_of_files ++ } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." > "/dev/stderr" }

Important:

The logical “AND” operator requires a double “ampersand”—i.e., “&&”—in “awk” (as it does in the C programming language).

Part 4: Subsetting the File List by Total Download Size.

Of course, if you have a specific limit in mind for the total download size, then you may simply want to select any file that keeps the total size within your set limit.

In other words, for every file:

If the current total download size, augmented with the current file size, remains within the limit, then select the file;
Otherwise, reject the file, since it would cause the total download size to exceed the limit.

Assume, for instance, that you want to download up to 2.5 GiB of data—i.e., 2.5*1024*1024*1024 bytes, or, if you prefer to perform only integer arithmetic, 5*512*1024*1024 bytes.

You could, then, update your “awk” script as follows:

Code:

#!/usr/bin/awk -f download_size + $1 <= 5 * 512 * 1024 * 1024 { print $3 ; download_size += $1 ; number_of_files ++ } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." > "/dev/stderr" }

The output from this script will look something like the following:

Code:

pool/universe/n/nexuiz-data/nexuiz-data_2.5.2.orig.tar.gz pool/universe/i/ia32-libs/ia32-libs_2.7ubuntu17.tar.gz pool/main/o/openoffice.org/openoffice.org_3.1.1.orig.tar.gz pool/multiverse/s/sauerbraten-data/sauerbraten-data_0.0.20090504.orig.tar.gz pool/universe/o/openarena-data/openarena-data_0.8.1.orig.tar.gz pool/universe/o/opencv/opencv_1.0.0.orig.tar.gz pool/universe/libe/libemail-localdelivery-perl/libemail-localdelivery-perl_0.217.orig.tar.gz 2560 MiB in 7 files.

Experiment 8: Creating the Download List, Checksum List, and Remaining File List.

Now that you can select a subset of files that you want to download, you are ready to actually perform the download. However, before you do so, there are a few additional features that you may want to take into account:

In addition to the actual download list, you may want to create a checksum list as well—i.e., a file that lists your selected files with their checksum values, to allow you to verify that all files are downloaded without errors.
You may also want to save the list of files that you do not select; then, after you download the selected files, you can use this list of remaining files to select the next subset.

In other words, you may want to update your “awk” script to create three, instead of just one, output files:

“download_filelist”—which contains the list of selected files, as described above.
“checksum_filelist”—which contains the list of selected files, in a format that will allow you to verify the integrity of the files after you download them. Each line of this file must consist of the following items:

The checksum value—which is a 32-character string;
A blank space;
An asterisk—i.e., “*”;
The path to the file, relative to the current directory.
“remaining_filelist”—which contains a copy of all input lines that you did not select.

Note that, since you will now have to take action on every input line, you will want to omit the selection expression, and let the action list decide which route to take.

Assuming that you want to continue to select files based on a total download size of up to 2.5 GiB, a skeletal version of your new “subset_filelist” script will, then, look something like this:

Code:

#!/usr/bin/awk -f { if (download_size + $1 <= 5 * 512 * 1024 * 1024) { # ...code to process selected files should go here... } else { # ...code to process remaining files should go here... } } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." > "/dev/stderr" }

The code to process remaining files is pretty simple: it should just copy the input line, unchanged, to a file named “remaining_filelist.” Since “awk” provides a copy of the entire input line in its special “$0” variable (that’s “dollar zero,” not “dollar uppercase oh,” by the way), that’s a pretty easy action to take:

Code:

print $0 > "remaining_filelist"

The code to output the name of a selected file to the ”download_filelist” is equally simple:

Code:

print $3 > "download_filelist"

Finally, to output a line to the “checksum_filelist,” you will have to identify the path to the data file relative to the current directory—in other words, you will have to include the “UbuntuSources” directory level on the path (unless you prefer to enter the “UbuntuSources” directory whenever you want to verify the checksums).

The complete output line will, therefore, contain the checksum value, a space, an asterisk, the literal “UbuntuSources/” string, and finally, the file path taken from the input line:

Code:

print $2 " *UbuntuSources/" $3 > "checksum_filelist"

Consequently, the final version of the “subset_filelist” script will look like this:

Code:

#!/usr/bin/awk -f { if (download_size + $1 <= 5 * 512 * 1024 * 1024) { print $3 > "download_filelist" ; print $2 " *UbuntuSources/" $3 > "checksum_filelist" ; download_size += $1 ; number_of_files ++ } else { print $0 > "remaining_filelist" } } END { print download_size / (1024 * 1024) " MiB in " number_of_files " files." > "/dev/stderr" }

Experiment 9: Downloading the Selected Files.

The list of files that you selected for downloading, is now available in the “download_filelist.” To actually download them, you will use the “wget” utility—which is described in its “man” page as “the non-interactive network downloader.”

In general terms, you pass the “wget” utility a list of files (such as your “download_filelist”), and supply it with a set of options to adapt its behaviour to your expectations.

Obviously, you will have to tell the “wget” program where it should look for the list of files that you want it do download; you can do this with its “--input-file” option—like so:

Code:

--input-file=download_filelist

Since the input file contains relative paths, you will also have to specify the base URL from which the files must be downloaded—i.e., the URL that must be prepended to the relative paths:

Code:

--base=http://archive.ubuntu.com/ubuntu/

By default, “wget” will download all files into the current directory; if you want to download to a different location, then you will have to specify the target location with the “--directory-prefix” option—e.g.:

Code:

--directory-prefix=UbuntuSources

Also, by default, “wget” will download all files directly to the target location, and it will not create a directory hierarchy. Assume, for instance, that the input file contains the following line:

Code:

pool/universe/i/ia32-libs/ia32-libs_2.7ubuntu17.tar.gz

The “ia32-libs_2.7ubuntu17.tar.gz” file will then be downloaded directly into the target location (e.g., “UbuntuSources”)—not into its “pool/universe/i/ia32-libs” subdirectory.

If you do want “wget” to create subdirectories at the target location, then you will have to specify the “--force-directories” option:

Code:

--force-directories

However, the directory hierarchy created by “wget” will, then, start at the host name from which you download; in other words, the program will download all files into an “archive.ubuntu.com” subdirectory tree; if you do not want this host name directory, then you should use the “--no-host-directories” option:

Code:

--no-host-directories

This time, however, the directory hierarchy will start at the first level following the host name—i.e., at the “ubuntu” directory. If you want to skip this level as well, and you want the directory hierarchy to start one level deeper (i.e., at the “pool” level), then you will have to use the “--cut-dirs” option, with the number of levels that you want to skip—i.e., “1”—as its argument value:

Code:

--cut-dirs=1

One further option that you may find useful, is “--no-clobber”:

Code:

--no-clobber

With this option, “wget” will refuse to download a file that already exists at the target location; thus, if you accidentally pass “wget” the same set of file names twice, then the second run will not redownload all the files that you have already gotten.

Caveat:

If any file gets downloaded only partially, or incorrectly, but you still want to specify the “--no-clobber” option when you subsequently retry the download, then you will first have to manually remove the file. Otherwise, the “--no-clobber” option will prevent it from being redownloaded.

With this discussion out of the way, here’s what the resulting “wget” command line comes to look like:

Code:

wget --input-file=download_filelist \ --base=http://archive.ubuntu.com/ubuntu/ \ --directory-prefix=UbuntuSources \ --force-directories --no-host-directories --cut-dirs=1 \ --no-clobber

Note that “wget” will display a nice progress indicator, to show you how the download is proceeding.

Experiment 10: Verifying the Checksums of the Downloaded Files.

Once the download completes, you can use the “md5sum” utility to verify if the checksums of the downloaded files match. Just specify the “checksum_filelist” on the “-c” option of the “md5sum” command—as follows:

Code:

md5sum -c checksum_filelist

If all files were downloaded without errors, then the output from the “md5sum” utility will look like this:

Code:

UbuntuSources/pool/universe/n/nexuiz-data/nexuiz-data_2.5.2.orig.tar.gz: OK UbuntuSources/pool/universe/i/ia32-libs/ia32-libs_2.7ubuntu17.tar.gz: OK UbuntuSources/pool/main/o/openoffice.org/openoffice.org_3.1.1.orig.tar.gz: OK UbuntuSources/pool/multiverse/s/sauerbraten-data/sauerbraten-data_0.0.20090504.orig.tar.gz: OK UbuntuSources/pool/universe/o/openarena-data/openarena-data_0.8.1.orig.tar.gz: OK UbuntuSources/pool/universe/o/opencv/opencv_1.0.0.orig.tar.gz: OK UbuntuSources/pool/universe/libe/libemail-localdelivery-perl/libemail-localdelivery-perl_0.217.orig.tar.gz: OK

If the checksum for any file does not match, then the file must have been downloaded with errors; “md5sum” will report this condition as follows:

Code:

UbuntuSources/pool/universe/n/nexuiz-data/nexuiz-data_2.5.2.orig.tar.gz: FAILED

To correct this error, you will have to try and redownload the file—but do remember the caveat about the “--no-clobber” option, above.

Finally, if any file simply could not be downloaded at all, then “md5sum” will output the following error messages:

Code:

md5sum: UbuntuSources/pool/universe/i/ia32-libs/ia32-libs_2.7ubuntu17.tar.gz: No such file or directory UbuntuSources/pool/universe/i/ia32-libs/ia32-libs_2.7ubuntu17.tar.gz: FAILED open or read

If the “md5sum” encounters any errors, then it will terminate with a summary text like the following:

Code:

md5sum: WARNING: 1 of 7 listed files could not be read md5sum: WARNING: 1 of 6 computed checksums did NOT match

Epilogue: What’s Next?

You have now successfully downloaded an initial subset of files from the online software archive.

To continue, you may want to select the next subset from the list of files that you saved to the “remaining_filelist.” You could, for example, replace your original “master_filelist” with this shorter list, as follows:

Code:

mv remaining_filelist master_filelist

This command moves (or renames) the source file, “remaining_filelist,” to the target, “master_filelist”; if the target file already exists (as is the case in this instance), then it will simply be overwritten without warning.

You can subsequently return to Experiment 6, 7, or 8 to select the next batch of files to download.

Alternatively, you can now run the “debmirror” utility for real, and let it download all remaining files from the repository—e.g.:

Code:

debmirror --progress \ --method=http --host=archive.ubuntu.com --root=ubuntu \ --dist=karmic,karmic-security,karmic-updates \ --section=main,multiverse,restricted,universe \ --arch=none \ ~/UbuntuSources

In fact, after you finish downloading all the subsets that you wanted to select, you will have to run this “debmirror” command just once—if only to get the control files in place.

Final Notes.

If you followed along with this post, then it introduced you to some of the incredibly powerful features that any Linux (or Unix) system provides—such as, regular expressions, the “grep” and “awk” commands, etc.—and you learned how they can help you select exactly which files you want to download from an online software repository.

Obviously, much more can be said about these features; if you want to study them further, then the “References” section, below, is an excellent place to start.

There’s one closely related utility that this post did not discuss: the stream editor, “sed.” In fact, conventional wisdom has it that there’s a natural learning sequence that begins with “grep,” then moves on to “sed,” and only then arrives at “awk.” In the context of this post, however, there was no need to talk about “sed,” and the post is more than long enough already without it. Even so, if you want to become really proficient at the text processing tools available with Unix-like systems, you will certainly encounter the stream editor sooner or later.

References.

How to make your own Ubuntu Repository DVDs—the initial post in this thread, by BobSongs.
How to get "debmirror" to download and validate the "Release.gpg" file—one of my earlier posts in this thread.
How to Create a "Source-Only" Local Mirror—another one of my earlier posts in this thread.
SecureApt—Debian Wiki.
Learning the bash Shell, written by Cameron Newham and Bill Rosenblatt, and published by O’Reilly Media.
sed & awk, written by Dale Dougherty and Arnold Robbins, also published by O’Reilly Media.
Regular Expression Recipes: A Problem-Solution Approach, written by Nathan A. Good, and published by Apress.
A Practical Guide to Ubuntu Linux, written by Mark G. Sobell, and published by Prentice Hall—specifically, Chapter 5, “The Linux Utilities,” Chapter 7, “The Shell,” Chapter 9, “The Bourne Again Shell,” and Appendix A, “Regular Expressions.”
The "more" man page.
The "less" man page.
The "find" man page.
The "wc" man page.
The "sort" man page.
The "cat" man page.
The "xargs" man page.
The "grep" man page.
The "sed" man page.
The "awk" man page.
The "head" man page.
The "tail" man page.
The "wget" man page.
The "md5sum" man page.
Your local man pages—which you can view with the “man” command (e.g., “man more”—“man less”—etc.—and even “man man”).
Prefixes for binary multiples.
Unix Humour.
Last edited by luvr; February 18th, 2010 at 10:08 AM.
Adv Reply
February 16th, 2010 #312
BobSongs

View Profile

View Forum Posts

Private Message

Dipped in Ubuntu
Join Date

Nov 2005

Location

Montreal, Canada

Beans

525
Re: How To: Make Your Own Ubuntu Repository DVDs

luvr

I thank you for each and every one of your posts. You've proven your skill and each of your posts is very well thought-out.

Thank you for adding to the credibility and the strength of this thread.

> Video tutorials for: GIMP, InkScape, and Asaph.
Buy a Linux PC, learn about Linux and get more Ubuntu software

Adv Reply
February 17th, 2010 #313
luvr

View Profile

View Forum Posts

Private Message

Frothy Coffee!
Join Date

Jan 2006

Location

Boom, Belgium

Beans

222

Distro
Ubuntu 10.04 Lucid Lynx
Re: How To: Make Your Own Ubuntu Repository DVDs

Heh... I've just noticed that this thread got named Tutorial of the Week on September 7th, 2009.

And here I was, thinking that this was far too exotic a topic to be appreciated by “the world at large...”

Adv Reply
February 21st, 2010 #314
BobSongs

View Profile

View Forum Posts

Private Message

Dipped in Ubuntu
Join Date

Nov 2005

Location

Montreal, Canada

Beans

525
Re: How To: Make Your Own Ubuntu Repository DVDs

Originally Posted by luvr

Heh... I've just noticed that this thread got named Tutorial of the Week on September 7th, 2009.

And here I was, thinking that this was far too exotic a topic to be appreciated by “the world at large...”

I didn't know that this thread won! lol I actually checked for a while and ... gave up looking, figuring it would be ... as you said: far too exotic. It's not like we're setting up DVD drivers or some super cool utility.

Thanks for noting the Tutorial of the Week award. I am very pleased that this collaborative effort was selected. Feel free to continue adding more content, as you have. I'll note your latest addition in the first post.

Regards

> Video tutorials for: GIMP, InkScape, and Asaph.
Buy a Linux PC, learn about Linux and get more Ubuntu software

Adv Reply
April 23rd, 2010 #315
wyrdrat

View Profile

View Forum Posts

Private Message

5 Cups of Ubuntu
Join Date

Apr 2010

Location

UK

Beans

21

Distro
Ubuntu 9.10 Karmic Koala
Re: How To: Make Your Own Ubuntu Repository DVDs

Further to the problem raised with karmic

Sensiva reports that 9.10 Karmic Koala has an issue with this, stemming from a bug in apt:

I recently asked a question in these forums on this topic and, on advice, submitted a bug report. Only afterwards did I find this thread while searching for something else. Just to let people know that if they purchase the 10 DVD set of Ubuntu 9.10 Karmic they will also encounter the apt bug whereby you are constantly swapping disks after each file instead of it getting everything from each disk at a time. So it seems there is no way to get a workable set of DVDs to install Karmic to offline computers. Unless anyone knows of a fix for it?

Otherwise, this is a useful tut, just not for Karmic.

Adv Reply
April 23rd, 2010 #316
Sensiva

View Profile

View Forum Posts

Private Message

Visit Homepage

Just Give Me the Beans!
Join Date

Aug 2007

Location

Cairo - Egypt

Beans

71

Distro
Ubuntu 10.04 Lucid Lynx
Re: How To: Make Your Own Ubuntu Repository DVDs

Originally Posted by wyrdrat

So it seems there is no way to get a workable set of DVDs to install Karmic to offline computers. Unless anyone knows of a fix for it?

Otherwise, this is a useful tut, just not for Karmic.

Since you already purchased the DVDs, then you have done the most hard work, all you have to do is extracting those DVDs into a directory, Then run debmirror directing it to that directory, it won't download more than 100MB of files, then point apt-get sources.list to that directory and you are done. Doing this on an external USB drive is a good idea, this is how I do it btw.

Good Luck

/Sensiva>

Egypt LoCo Team Homepage | Ubuntu Egypt Forum | Mailing List | Become A Member | Members list | Chat with us...

Adv Reply
April 24th, 2010 #317
wyrdrat

View Profile

View Forum Posts

Private Message

5 Cups of Ubuntu
Join Date

Apr 2010

Location

UK

Beans

21

Distro
Ubuntu 9.10 Karmic Koala
Re: How To: Make Your Own Ubuntu Repository DVDs

Thanks very much for the helpful advice.

Do you think they'll bother to fix apt for Karmic?

Does anyone know if the DVDs or this tutorial work in Lucid yet?

Adv Reply
April 24th, 2010 #318
k3lt01

View Profile

View Forum Posts

Private Message

Skinny Extra Sweet Ubuntu
Join Date

Oct 2007

Location

Australia

Beans

1,715

Distro
Ubuntu Development Release
Re: How To: Make Your Own Ubuntu Repository DVDs

Originally Posted by wyrdrat

Do you think they'll bother to fix apt for Karmic?

Not with Lucid, an LTS version, coming out in a week.

Originally Posted by wyrdrat

Does anyone know if the DVDs or this tutorial work in Lucid yet?

I'm sure it will work but its not worth your time starting on Lucid just yet as the repositories are not in final until the full release is made. Any updates made, and yes there can be security updates, will make the initial release versions different to what is available now.

Ubuntu User 23142 | Wiki | Laptop | HowTo:Create a background slideshow and Screensaver | Reconditioning pre-loved PCs and installing Ubuntu to give away to good homes.

Adv Reply
May 2nd, 2010 #319
wyrdrat

View Profile

View Forum Posts

Private Message

5 Cups of Ubuntu
Join Date

Apr 2010

Location

UK

Beans

21

Distro
Ubuntu 9.10 Karmic Koala
Re: How To: Make Your Own Ubuntu Repository DVDs
Following Sensiva's advice this is what I did:

Created a directory on the hard disk (you will need about 32GB) called karmicrepository and created subdirectories called main1 - main3, multi1 - multi6, uni1 - uni6 etc. I then copied all the folders from the pool directory of each DVD into the hard drive directories. I used the command line tool in terminal

Code:

cp -R /cdrom/pool/main /karmicrepository/main1

I found that copying and pasting in Dolphin missed some files. I then created a Packages.gz file by typing into terminal, having navigated to the karmicrepository directory

Code:

apt-ftparchive packages ./ ¦ gzip > Packages.gz

I then opened /etc/apt/sources.list and typed in:

deb file:/karmicrepository ./

Then in terminal:

Code:

apt-get update

Now I can use the repository properly.

I know this is the reverse of this tutorial, but I thought it might help anyone who purchased DVDs for Karmic and are unable to use them because of the bug in apt and who do not have an internet connection to the computer they are installing it to. Of course this doesn't help if you don't have a spare 32gb on your hard drive ...

Incidentally, has anyone tried the Lucid DVDs? Has the apt bug been fixed?
Adv Reply
June 10th, 2010 #320
badman35

View Profile

View Forum Posts

Private Message

Just Give Me the Beans!
Join Date

Feb 2008

Location

Virginia

Beans

74

Distro
Ubuntu 14.04 Trusty Tahr
Re: How To: Make Your Own Ubuntu Repository DVDs

BOB why you put 8.04 in the classic RED it still has a good year left in the world??

The Ubuntu Counter Project - user number # 28994

Adv Reply

Quick Navigation Tutorials Top

« Previous Thread | Next Thread »

Tags for this Thread

dvd, repository

Bookmarks

Bookmarks

Posting Permissions

You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
[VIDEO] code is Off
HTML code is Off

Ubuntu Forums Code of Conduct

All times are GMT +1. The time now is 05:42 AM.

vBulletin ©2000 - 2024, Jelsoft Enterprises Ltd. Ubuntu Logo, Ubuntu and Canonical © Canonical Ltd. Tango Icons © Tango Desktop Project.
User contributions on this site are licensed under the Creative Commons Attribution Share Alike 4.0 International License. For details and our forum data attribution, retention and privacy policy, see here