The Five Best Ideas in Practical Programming
© by Mike Robinson
Digital computers are the ultimate “information-handling power tool.” Today, a compact machine that you can buy for just a few hundred dollars can do things which could not have been done, even for many millions of dollars, just a few years ago. But... how do they do it? What are the techniques of computerized data-processing that “bring home the bacon” today?
Let's take a look ...
#1: Hashing
It is very often the case that the computer needs to “look something up” in its memory, looking for the information by means of a particular key and doing only a minimum amount of “searching” to find it. Hash-tables are exactly what the doctor ordered, and most modern programming languages now have explicit built-in support for it.
Languages which provide hashing also invariably provide other closely-related memory handling tools such as “references,” “reference-counting,” and “garbage-collection.” It's fairly unthinkable to build serious programs without these features, even though they are certainly not new ideas. They first appeared in one of the very first computer programming languages: LISP.
One drawback of hashing is that it's strictly “one way.” A lookup of a key will lead you to a value or to a list-of-values, but if you want to go the other way, you must maintain two or more hashes in parallel to each other. Also, when hashes grow to a certain size they can produce a performance-crippling behavior called thrashing.
#2: Sorting
Those of us who remember (and who still love) those early, campy science-fiction movies remember that “a computer” was a room full of exotic-looking cabinets with blinking lights ... and spinning tapes. Information was taken from the keypunch operators, recorded onto magnetic tapes and frantically processed, just in time to save the Earth from those pesky invading Martians.
Ironically enough, this technique (minus the Martians, the tapes, and the keypunch operators...) is still the best way to process large quantities of information. The secret was ... and still is ... that the information that's being processed has been sorted.
When you know that you're looking at a sorted stream of data ... let's say, “a deck of cards,” ... then you know that you can say some very-useful things about it:
- Turn (only) the first card of the deck face-up. If you're not staring at the Ace of Spades right now, you know that it does not exist anywhere in the deck. Furthermore, you know that every card prior to the one that you do see right now also does not exist, anywhere in the deck.
- When you encounter a gap in the card-sequence, you know that no cards exist anywhere in the deck which would fall into that gap. (If they did exist, they would be “right here.”)
- It's a trivial matter to, say, count the total number of Hearts in a sorted deck. Start with the first Heart you see, and count to the last Heart you see. (If you're looking at a Diamond or a Club, you know there are no Hearts in the deck at all.) If you know that the deck was sorted by Suit and then by card-number within each suit, you know that when you encounter the first Heart there can be none before it, and that when you encounter a different suit, no further Hearts remain.
- If you need to find a starting-point in the deck (such as “the first heart”), you can do this by dividing the deck in half, choosing one of the halves, dividing it in half, and so-on. In this way, you can find your starting-point by looking at no more than six cards.
- If you needed to ensure that no one was “stacking the deck,” detection of duplicates is now a trivial exercise: all of the duplicated occurrences of any value, if they exist, will be on adjacent cards. No “searching” is required to find them.
- Here's the entire, overwhelming advantage in a nutshell: no searching is required!
#3: Regular Expressions
Truly the Swiss Army® Knife of data-processing, regular expressions (or “regexes”) tend to get a bad name because they look like a bunch of chicken-scratches. For example:
m{\(((?>[^()]+)|\([^()]*\))+\)}xBut... if you wanted to “efficiently match a nonempty group with matching parentheses two levels deep or less,” and to extract all occurrences of this patterns from perhaps many million bytes of information, this particular string of chicken-scratches (borrowed from perldoc perlre) will do just that.
Regular expressions are built for pattern matching and string substitution. In other words, they're great when you want to:
- Find records which contain one or more occurrences of some particular pattern.
- Extract the information that does (or doesn't) match a pattern.
- Replace the matched information with something else.
A regular-expression is, in effect, a tiny computer-program expressed in a single string of chicken-scratches. But it's incredibly powerful and efficient. When coupled with higher-level tools like sed, awk, and perl, regular-expressions can be used for a variety of useful tasks, like extracting information from another form of computer-output.
#4: Objects
Most computer software today is built from pre-existing components. The “new” portions of a system are also built as components in the expectation that those components can be re-used.
How do you do that, though? How do you enable two pieces of software that have never known each other to “play well together,” or even coexist? The answer is: objects.
An “object” is an encapsulation of programming and data into one “thing,” having well-defined boundaries. When done properly, this technique allows the developers of different objects to produce useful things without having to be impossibly concerned with “side effects” from other objects which might be sharing the digital swimming-pool with theirs.
#5: TCP/IP (Reliable, Secure Networking)
We would not have the Internet, would not have cellular phones, would not have electronic funds transfers, would not have anything, without the presence of robust, reliable networking such as TCP/IP.
Networking systems allow you to create a reliable one- or two-way connection between two endpoints on a network, and to send units of information reliably between them. The messages, or datagrams, will arrive at their destination, complete and in the sequence they were presented by their sender. Upon this basic foundation, along with agreed-upon (and often, disagreed-on...) protocols, an entire world of reliable data-communications can be built.
Data communications, including so-called Interprocess Communication (IPC) taking place within a single computer or computing cluster, allows the combined efforts of multiple programs and/or computers to be yoked together to the same business task.
These data-communications can also be secure. So-called “tunneling” protocols, such as VPN and SSLv2, allow information to be freely exchanged between sender and receiver as-before ... without express regard for security by either of them ... and to nevertheless be utterly incomprehensible while in-transit. (You see this yourself when you use a “secure” web-page.) Third-parties can neither understand the information that they may intercept, nor alter any of it without detection, nor introduce their own traffic into the stream. Because the tunnel is secure, all the information passing through the tunnel requires no additional layers of security, and the participants can largely ignore its presence.
This information-security is made possible through a rather-magical invention called public-key cryptography, or PKI. This technology allows reliable, verifiable two-way communication between two parties who have never met.