My strategy is going to be easy: pull the string apart into individual characters, analyze each character to identify if it’s an alphanumeric, and if it’s not, convert it into its hexadecimal ASCII equivalent, prefacing it with a “%” as needed.
There are a number of ways to break a string into its individual letters, but let’s use Bash string variable manipulations, recalling that ${#var} returns the number of characters in variable $var, and that ${var:x:1} will return just the letter in $var at position x. Quick now, does indexing start at zero or one?
Here’s my initial loop to break $original into its component letters:
input="$*"
echo $input
62 | October 2018 | http://www.linuxjournal.com
WORK THE SHELL
for (( counter=0 ; counter < ${#input} ; counter++ )) do
echo "counter = $counter -- ${input:$counter:1}"
done
Recall that $* is a shortcut for everything from the invoking command line other than the command name itself—a lazy way to let users quote the argument or not.
It doesn’t address special characters, but that’s what quotes are for, right?
Let’s give this fragmentary script a whirl with some input from the command line:
$ sh normalize.sh "li nux?"
li nux?
counter = 0 -- l counter = 1 -- i counter = 2 --counter = 3 -- n counter = 4 -- u counter = 5 -- x counter = 6 -- ?
There’s obviously some debugging code in the script, but it’s generally a good idea to leave that in until you’re sure it’s working as expected.
Now it’s time to differentiate between characters that are acceptable within a URL and those that are not. Turning a character into a hex sequence is a bit tricky, so I’m using a sequence of fairly obscure commands. Let’s start with just the command line:
$ echo '~' | xxd -ps -c1 | head -1 7e
Now, the question is whether “~” is actually the hex ASCII sequence 7e or not.
WORK THE SHELL
A quick glance at http://www.asciitable.com confirms that, yes, 7e is indeed the ASCII for the tilde. Preface that with a percentage sign, and the tough job of conversion is managed.
But, how do you know what characters can be used as they are? Because of the weird way the ASCII table is organized, that’s going to be three ranges: 0–9 is in one area of the table, then A–Z in a second area and a–z in a third. There’s no way around it, that’s three range tests.
There’s a really cool way to do that in Bash too:
if [[ "$char" =~ [a-z] ]]
What’s happening here is that this is actually a regular expression (the =~) and a range [a-z] as the test. Since the action I want to take after each test is identical, it’s easy now to implement all three tests:
if [[ "$char" =~ [a-z] ]]; then output="$output$char"
elif [[ "$char" =~ [A-Z] ]]; then output="$output$char"
elif [[ "$char" =~ [0-9] ]]; then output="$output$char"
else
As is obvious, the $output string variable will be built up to have the desired value.
What’s left? The hex output for anything that’s not an otherwise acceptable character.
And you’ve already seen how that can be implemented:
hexchar="$(echo "$char" | xxd -ps -c1 | head -1)"
output="$output%$hexchar"
64 | October 2018 | http://www.linuxjournal.com
WORK THE SHELL
A quick run through:
$ sh normalize.sh "li nux?"
li nux? translates to li%20nux%3F
See the problem? Without converting the hex into uppercase, it’s a bit weird looking.
What’s “nux”? That’s just another step in the subshell invocation:
hexchar="$(echo "$char" | xxd -ps -c1 | head -1 | \ tr '[a-z]' '[A-Z]')"
And now, with that tweak, the output looks good:
$ sh normalize.sh "li nux?"
li nux? translates to li%20nux%3F
What about a non-Latin-1 character like an umlaut or an n-tilde? Let’s see what happens:
$ sh normalize.sh "Señor Günter"
Señor Günter translates to Se%C3B1or%200AG%C3BCnter
Ah, there’s a bug in the script when it comes to these two-byte character sequences, because each special letter should have two hex byte sequences. In other words, it should be converted to se%C3%B1or g%C3%BCnter (I restored the space to make it a bit easier to see what I’m talking about).
In other words, this gets the right sequences, but it’s missing a percentage sign—
%C3B should be %C3%B, and %C3BC should be %C3%BC.
Undoubtedly, the problem is in the hexchar assignment subshell statement:
hexchar="$(echo "$char" | xxd -ps -c1 | head -1 | \
WORK THE SHELL
tr '[a-z]' '[A-Z]')"
Is it the -c1 argument to xxd? Maybe. I’m going to leave identifying and fixing the problem as an exercise for you, dear reader. And while you’re fixing up the script to support two-byte characters, why not replace “%20” with “+” too?
Finally, to make this maximally useful, don’t forget that there are a number of symbols that are valid and don’t need to be converted within URLs too, notably the set of
“-_./!@#=&?”, so you’ll want to ensure that they don’t get hexified (is that a word?). ◾
Send comments or feedback
via http://www.linuxjournal.com/contact or email [email protected].
66 | October 2018 | http://www.linuxjournal.com
It’s odd that printk() would pose so many problems for kernel development, given that it’s essentially just a replacement for printf() that doesn’t require linking the standard C library into the kernel.
And yet, it’s famously a mess, full of edge cases, corner cases, deadlocks, race conditions and a variety of other tough-to-solve problems. The reason for this is, unlike printf(), the printk() system call has to produce reasonable behavior even when the entire system is in the midst of crashing. That’s really the whole point—printk() needs to report errors and warnings that can be used to debug whatever strange and unexpected catastrophe has just hit a running system.
Trying to fix all the deadlocks and other problems at the same time would be too large a task for anyone, especially since each one is a special case defined by the particular context in which the printk() call appeared. But, sometimes a bunch of instances in a particular region of code can be addressed all together.
diff -u
Zack Brown is a tech journalist at Linux Journal and Linux Magazine, and is a former author of the “Kernel Traffic” weekly newsletter and the “Learn Plover”
stenographic typing tutorials.
He first installed Slackware Linux in 1993 on his 386 with 8 megs of RAM and had his mind permanently blown by the Open Source community.
He is the inventor of the Crumble pure strategy board game, which you can make yourself with a few pieces of cardboard. He also enjoys writing fiction, attempting animation, reforming Labanotation, designing and sewing his own clothes, learning French and spending time with friends’n’family.