Beruflich Dokumente
Kultur Dokumente
Background
I have always been an extreme fan of SGI. Ever since I saw the first stereographics demo. I
undertook one course of postgraduate work over others to work on the universities big 8 processor
SGI Power Challenge, while my friends fought with the little Cray EL92.
When I had the chance to build a 4 CPU Onyx2 with a single graphics pipe into an InfiniteReality3
into a 24 CPU machine with two graphics pipes, I went for it. I sourced parts from Italy, Sweeden,
France, Canada, but most of it came from the east and west coasts of the USA.
http://pymblesoftware.com/onyx2.html
I have collected The afore mentioned 24 CPU cray linked Onyx2, 8 SGI Indys, a pair of O2s, a pair
of Octance, a Origin 200 with GigaChannel, three Origin 300s, a pair of Personal IRISes, an Indigo2,
an Indigo and I am probably forgetting something.. Like I said, I am a bit of an SGI fan.
My relationship with the company however has always been interesting. To say the least.
The problem
I wanted to link 3 of the Origin 300s into a single system image. Instead of 3 separate servers with 4
CPUs and 4Gb of RAM each, they will look like a single system motherboard with 12 RISC CPUs
and 12 Gb of RAM. Done with special cables as thick as your arm with LEDs in the connectors.
I called the sales rep at SGI. The same one I always seems to deal with. I have three Origin 300s and
I want to ccNUMA link them into a single system image, I say. Several days later, if the parts are
available, then maybe they can do it for about $50,000. “Interesting”, I thought. So I get my hands
on an L2 controller and a Origin 3000 series ccNUMAlink router brick for less than a thousand
dollars. Here is where things start to get interesting. I connect everything up, even though it is 3000
series router brick and not a Origin 300 NUMALink module.
I called the sales rep at SGI. The same one I always seems to deal with. I have three Origin 300s and
I want to ccNUMA link them into a single system image, I say. Several days later, if the parts are
available, then maybe they can do it for about $50,000. “Interesting”, I thought. So I get my hands
on an L2 controller and a Origin 3000 series ccNUMAlink router brick for less than a thousand
dollars. Here is where things start to get interesting. I connect everything up, even though it is 3000
series router brick and not a Origin 300 NUMALink module.
So I fire the the machine up and immediately get a “Serial number mismatch” error on the L1 LCDs
of every module. Fine. Lets find a way around this.
A bit of investigation
L1001231-001-L2>ver
L2 version: 1.36.0
L1001231-001-L2>serial
L2 system serial number: L1001231.
L1001231-001-L2>
L1001231-001-L2>1.25 l1
entering L1 mode 001r25, to escape to L2
001r25-L1>ver
L1 1.40.4 (Image B), Built 09/29/2005 13:42:59 [Base 1MB image]
001r25-L1>serial all
001r25-L1>
L1001231-001-L2>reboot_l2
will reboot in 5 seconds…
INIT: Switching to runlevel: 6
Sending processes the TERM signal
Restartinÿ
Linux/PPC load:
Uncompressing Linux…done.
Now booting the kernel
Linux version 2.4.7-sgil2 (dsd@tstorm) (gcc version 2.95.2 19991030 (2.95.3 prerelease/franzo)) #1
Mon Feb 28 14:51:03 CST 2005
On node 0 totalpages: 4096
zone(0): 4096 pages.
zone(1): 0 pages.
zone(2): 0 pages.
Kernel command line: root=/dev/ram panic=5
Decrementer Frequency = 187500000/60
Calibrating delay loop… 49.76 BogoMIPS
Memory: 11904k available (952k kernel code, 512k data, 180k init, 0k highmem)
Dentry-cache hash table entries: 2048 (order: 2, 16384 bytes)
Inode-cache hash table entries: 1024 (order: 1, 8192 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer-cache hash table entries: 1024 (order: 0, 4096 bytes)
Page-cache hash table entries: 4096 (order: 2, 16384 bytes)
POSIX conformance testing by UNIFIX
PCI: Probing PCI hardware
I/O resource not set for host bridge 0
Memory resource not set for host bridge 0
PCI: Cannot allocate resource region 0 of PCI bridge 0
PCI: resource is 80000000..7fffffff (100), parent c011c314
PCI:00:04.0: Resource 0: c0000000-c0000fff (f=200)
PCI:00:05.0: Resource 0: c0001000-c0001fff (f=200)
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Starting kswapd v1.8
i2c-core.o: i2c core module
i2c-dev.o: i2c /dev entries driver module
i2c-core.o: driver i2c-dev dummy driver registered.
i2c-algo-8xx.o: i2c mpc8xx algorithm module
i2c-rpx.o: i2c RPX Lite/MBX module
i2c-dev.o: Registered ‘rpx’ as minor 0
i2c-core.o: adapter rpx registered as adapter 0.
Console: switching to frame buffer device
fb0: SGI L2 (SED137x LCD controller) frame buffer device
fb0: Display panel [mono]: Hantronix HDM3224 (320!240, 4-bit Greyscale)
CPM UART driver version 0.03
ttyS00 at 0!0000 is a SCC
ttyS01 at 0!0100 is a SCC
ttyS02 at 0!0200 is a SCC
ttyS03 at 0!0300 is a SCC
WDT_8xx: Software Watchdog Timer version 0.3, 30 second timeout
block: queued sectors max/low 7810kB/2603kB, 64 slots per queue
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
eth0: FEC ENET Version 0.1, 08:fec: Phy @ 0!0, type 0!78100003
fec: link down
00:fec: 10 Mbps, Half-Duplex
69:11:b1:77
PowerPC realtime clock driver, version 0.1.
usb.c: registered new driver usbdevfs
usb.c: registered new driver hub
PCI: Enabling device 00:04.0 (0000 -> 0002)
usb-ohci.c: USB OHCI at membase 0xc2002000, IRQ 8
usb-ohci.c: usb-00:04.0, PCI device 11c1:5802 (Lucent Microelectronics)
usb.c: new USB bus registered, assigned bus number 1
Product: USB OHCI Root Hub
SerialNumber: c2002000
hub.c: USB hub found
hub.c: 2 ports detected
PCI: Enabling device 00:05.0 (0000 -> 0002)
usb-ohci.c: USB OHCI at membase 0xc2004000, IRQ 10
usb-ohci.c: usb-00:05.0, PCI device 11c1:5802 (Lucent Microelectronics)
usb.c: new USB bus registered, assigned bus number 2
Product: USB OHCI Root Hub
SerialNumber: c2004000
hub.c: USB hub found
hub.c: 2 ports detected
usb-ohci.c: v5.2:USB OHCI Host Controller Driver
usb.c: registered new driver sgil1
usb.c: registered new driver sgil1
usb.c: registered new driver sgil1
sgil1.c: SGI L1 controller support registered
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 1024 bind 1024)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: cramfs filesystem found at block 0
RAMDISK: overriding ramdisk block size to 4096 for cramfs filesystem
RAMDISK: Loading 2012 blocks [1 disk] into ram disk… done.
Freeing initrd memory: 2012k freed
VFS: Mounted root (cramfs filesystem).
Freeing unused kernel memory: 180k init
INIT: version 2.77 booting
cp: /rhosts.allow: No such file or directory
Starting DHCP client daemon….
hub.c: USB new device connect on bus1/1, assigned device number 2
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 64 device: 1.2
hub.c: USB new device connect on bus1/2, assigned device number 3
hub.c: USB hub found
hub.c: 7 ports detected
hub.c: USB new device connect on bus1/2/1, assigned device number 4
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 65 device: 1.4
hub.c: USB new device connect on bus1/2/2, assigned device number 5
usb.c: USB device not accepting new address=5 (error=-110)
hub.c: USB new device connect on bus1/2/2, assigned device number 6
usb.c: USB device not accepting new address=6 (error=-110)
hub.c: USB new device connect on bus1/2/4, assigned device number 7
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 66 device: 1.7
hub.c: USB new device connect on bus1/2/5, assigned device number 8
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 67 device: 1.8
dhcpcd[28]: timed out waiting for a valid DHCP server response
INFO: No DHCP server found, starting local DHCP server (to serve L3 clients).
INFO: DHCP: new IP address is 10.17.177.119
INIT: Entering runlevel: 5
SGI L2 Controller
Current L2 version: 1.36.0 (L2 emulator: 1.36.0)
Flashed L2 version: 1.36.0
L1001231-001-L2>serial all
001c01:
001c02:
001c03:
001r25:
L1001231-001-L2>
L1001231-001-L2> 1.1 l1
entering L1 mode 001c01, to escape to L2
001c01-L1>serial all
001c01-L1>ver
L1 1.30.14 (Image B), Built 08/05/2004 11:09:57 [Base 1MB image]
001c01-L1>
L1001231-001-L2>1.2 l1
entering L1 mode 001c02, to escape to L2
001c02-L1>serial all
Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2001411
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MLJ194
Reference Brick Serial Number NVRAM MLJ194
001c02-L1>ver
L1 1.12.6 (Image B), Built 04/22/2002 08:13:40 [1MB image]
001c02-L1>
L1001231-001-L2>1.3 l1
entering L1 mode 001c03, to escape to L2
001c03-L1>serial all
001c03-L1>ver
L1 1.8.4 (Image B), Built 10/30/2001 11:47:34 [P1 support]
001c03-L1>
Resolution
You can’t set a serial number with the prefix ‘L’ on an O300
L2 command processor engaged, for console mode.
L1001231-001-L2>1.1 l1
entering L1 mode 001c01, to escape to L2
serial clear
001c01-L1>serial
BSN: KJD687 SSN: L0000000 Time: 06/04/2009 08:14:15 CDT
001c01-L1>
001c01-L1>
001c01-L1>serial clear
001c01-L1>
001c01-L1>
001c01-L1>001c01 INFO: System serial number reassigned to Mxxxxxx from attached L2.
So I ask around..
You’ll want to get all of your L1s (and PROMs if they’re not) at the same version after
you get
everything talking, it’ll save you some headaches in the longrun. I’m at 1.22.4 on
everything except
the L2 (which is 1.32.4) and it’s been working flawlessly (and no, it didn’t enable
security on my O300s
and I’m still able to do a “serial clear” from the L1 successfully). To be safe though, get
them
connected together first.
Oh, btw…the easiest way to get the serials on your O300s in sync is to set the L2 to one
they can set
(prefix ‘M’) then do a ’serial clear’ on each O300…if everything is wired properly it
should pick up
the serial from the L2 Controller automagically.
I respond
DIAG RESULTS:
ALL DIAGS PASSED.
**** End System Configuration and Diagnostics Summary ****
1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor
Option?
Some Explaination
With the mismatching serial numbers, the compute modules and the router brick will not fire up
together.
Disconnected (from L2 & USB) the router brick will start.Disconnected it is not mis-matched with
anything.
Having started the router brick it is then ok (with the older firmware) to reconnect it..
Because the L1s on the Origin 300 compute modules were started without the router present they
don’t know of each others existence.
reseting all the compute modules cause them to “discover” the r-brick during initialisation and thus
each other and become a single system image.
Comments (0)
No Comments
No comments yet.
Pages:
About
Blogroll
Development Blog
Documentation
Plugins
Suggest Ideas
Support Forum
Themes
WordPress Planet
Categories:
IT Business strategy
R2D2 Project
Stereographics
Store
Telephony
The IRIX PCI bus
Uncategorized
VME Bus programming.
Search:
Search
Archives:
February 2010
October 2009
August 2009
Meta:
Log in
RSS
Comments RSS
Valid XHTML
XFN
WP
Powered by WordPress